What is julia doing with your threads?

As documented here and in this thread, spinning up julia with a certain number of threads, let us say N, and then starting (N-1) tasks does not necessarily mean that all the tasks will have a thread to run on. It appears that julia uses some of those threads for other purposes, and some of the tasks have to wait for one of the threads to become available. This obviously causes a loss of performance.

So, the question is: Does anyone know what would julia be using the threads for?

You’re on apple silicon. Maybe they’re running on your efficiency cores rather than your big cores.

Your printed start times were much more comparable than your finishing times.

Do your start times print only after it has actually started? If so, you can’t blame it on a task starting to run later.

1 Like

I do not know how to ask specifically for the threads to run on particular cores.
How does one do it?

This is the parallel loop:

    Threads.@sync begin
        for ch in chunks(1:count(fes), ntasks)
            @info "$(ch[2]): Started $(time() - start)"
            buffer_range, iend = _update_buffer_range(elem_mat_nrows, elem_mat_ncols, ch[1], iend)
            Threads.@spawn let r = $ch[1], b = $buffer_range
                @info "$(ch[2]): Spawned $(time() - start)"
                femm1 = FEMMHeatDiff(IntegDomain(subset(fes, r), GaussRule(3, 3)), material)
                _a = _task_local_assembler(a, b)
                @info "$(ch[2]): Started conductivity $(time() - start)"
                conductivity(femm1, _a, geom, Temp)
                @info "$(ch[2]): Finished $(time() - start)"
            end
        end
    end

I do not quite understand “you can’t blame it on a task starting to run later”. Could you elaborate?

On Linux, there is ThreadPinning. I am not aware of any MacOS-solutions.

Could you elaborate?

It appears that julia uses some of those threads for other purposes, and some of the tasks have to wait for one of the threads to become available.

Here, you made it sound like these tasks are waiting on a blocked task before they can begin.
Of course, it is possible for conductivity to yield, and then they get blocked. But at least they were able to begin running relatively quickly.

1 Like

conductivity does nothing fancy. It is a serial code.
As I am showing, in Strategy A some tasks do not run as fast as others. There is a fast group and a slow group of tasks: slow appear to be waiting for some of the fast tasks to finish to release a computing thread for one of the slow tasks.

If you have another explanation, I am all ears…

MacOS simply doesn’t allow users to pin threads. If it would, I’d immediately add support for it to ThreadPinning.jl. Probably never going to happen though given that Apple has even removed thread affinity control features in the past.

2 Likes

Serial codes can do fancy things with side effects as well.

As I mentioned before in the other threads, you should try to reduce the complexity of your example. This process, while nontrivial and potentially time consuming, will likely get you more insights than just benchmarking and guessing. Also, as I mentioned before as well, your @threads based variant seems to work fine so I recommend you try to investigate what’s different between the two implementations. (Afaics, both variants spawn a task per thread and try to do the same work).

I guess another thing you could try is use Tracy to profile the C runtime.

Btw, although it might not be issue in your case, I would try to benchmark without parallel printing from each task to rule out IO interference.

3 Likes

Note that things work fine with Strategy B even with all the printing.
So I doubt IO is an issue.

I think there is already something to go on as far as evidence for tasks waiting
unnecessarily in Strategy A: fast and slow groups indicate that something causes
threads to be unavailable and tasks cannot run. I do not have an insight into what that something could be.
Probably some internals of the Julia libraries? Hopefully someone here knows…?

OK, here is a MWE:

module mwe_tasks
using Base.Threads
function work(r)
    s = 0.0
    for j in r
        s = s + exp(j^2)
    end
    s
end
nchunks = 4
N = 100000000
chunks = [(((i-1)*N+1:i*N), i) for i in 1:nchunks]
s = Float64[]
start = time()
Threads.@sync begin
    for ch in chunks
        @info "$(ch[2]): Started $(time() - start)"
        Threads.@spawn let r = $ch[1], i = $ch[2]
            @info "$(i): Spawned $(time() - start)"
            push!(s, work(r))
            @info "$(i): Finished $(time() - start)"
        end
    end
end
end

I start julia with 5 threads and run as shown with four tasks. Repeat a few times, and suddenly there is again a group of fast tasks and a group of slow tasks:

ulia> julia> include(raw"mwe_tasks.jl")
WARNING: replacing module mwe_tasks.
[ Info: 1: Started 6.00004e-03
[ Info: 2: Started 6.99997e-03
[ Info: 3: Started 8.00014e-03
[ Info: 4: Started 8.00014e-03
[ Info: 1: Spawned 4.30000e-02
[ Info: 3: Spawned 4.60000e-02
[ Info: 2: Spawned 4.90000e-02
[ Info: 4: Spawned 5.30000e-02
[ Info: 1: Finished 1.33700e+00
[ Info: 3: Finished 1.33900e+00
[ Info: 4: Finished 1.33900e+00
[ Info: 2: Finished 1.34000e+00
Main.mwe_tasks

julia> include(raw"mwe_tasks.jl")
WARNING: replacing module mwe_tasks.
[ Info: 1: Started 6.00004e-03
[ Info: 2: Started 7.00021e-03
[ Info: 3: Started 8.00014e-03
[ Info: 4: Started 8.00014e-03
[ Info: 1: Spawned 7.60000e-02
[ Info: 4: Spawned 8.00002e-02
[ Info: 2: Spawned 8.50000e-02
[ Info: 3: Spawned 9.20000e-02
[ Info: 4: Finished 1.34300e+00
[ Info: 1: Finished 1.34300e+00
[ Info: 3: Finished 2.59600e+00
[ Info: 2: Finished 2.59700e+00
Main.mwe_tasks

julia> 

What your output of:

versioninfo()

?

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Ă— Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-client)
  Threads: 6 on 8 virtual cores

Similar results also on:

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 Ă— Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 6 on 16 virtual cores

Well, your CPU has only 4 cores… I would assume one core is needed for the OS, so you have 3 cores that can useful work in Julia… Don’t be surprised with strange behavior when you oversubscribe your CPU… While hyperthreading in rare cases helps, in practice it often makes things worse, in particular if you try to achieve reproducible timings…

See edit above. Have you tried yourself?

Yes, I tried myself. I get reproducable results:

julia> include("mwe_tasks.jl")
WARNING: replacing module mwe_tasks.
[ Info: 1: Started 0.003551006317138672
[ Info: 2: Started 0.003679037094116211
[ Info: 3: Started 0.0037190914154052734
[ Info: 4: Started 0.003751993179321289
[ Info: 1: Spawned 0.020360946655273438
[ Info: 4: Spawned 0.022195100784301758
[ Info: 2: Spawned 0.0240170955657959
[ Info: 3: Spawned 0.025805950164794922
[ Info: 2: Finished 0.78507399559021
[ Info: 3: Finished 0.7867310047149658
[ Info: 1: Finished 0.7918789386749268
[ Info: 4: Finished 0.7983829975128174
Main.mwe_tasks

The first run needs 1s due to compilation overhead, any following run needs 0.81s ±0.02s.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Ă— AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 47 on 32 virtual cores
Environment:
  LD_LIBRARY_PATH = /lib:/usr/lib:/usr/local/lib

The experiment requires multiple attempts. Run it a few times please.

Was running it 50 times now. Consistent results as reported before…

Your fast CPU has 16 next-generation high-performance cores and eight next-generation high-efficiency cores. This means you never know if your threads are running on the fast or on the slow cores.
See also: How to bind threads to performance… | Apple Developer Forums

My CPU has 16 fast cores…

Good point about the various types of cores on the mac.

Here is another series of measurements though (making it a little bit easier to run a series of trials):

module mwe_tasks
using Base.Threads
function work(r)
    s = 0.0
    for j in r
        s = s + exp(-(j-minimum(r))^2)
    end
    s
end
function test()
nchunks = 5
N = 100000000
chunks = [(((i-1)*N+1:i*N), i) for i in 1:nchunks]
s = Float64[]
start = time()
Threads.@sync begin
    for ch in chunks
        @info "$(ch[2]): Started $(time() - start)"
        Threads.@spawn let r = $ch[1], i = $ch[2]
            @info "$(i): Spawned $(time() - start)"
            push!(s, work(r))
            @info "$(i): Finished $(time() - start)"
        end
    end
end
@info "Finished $(time() - start)"
# @show s
end
end
using Main.mwe_tasks; 
ts = []
for n in 1:50
    push!(ts, @elapsed mwe_tasks.test())
end
@show extrema(ts)

On this machine

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 Ă— AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 6 on 32 virtual cores

I get extrema(ts) = (7.40846e-01, 1.47955e+00). So clearly the same behavior is reproduced.

You new test program gives me, when Julia is launched with -t 5:

extrema(ts) = (0.743552106, 2.155290662)
(0.743552106, 2.155290662)

when launched with -t auto:

extrema(ts) = (0.741168447, 0.828413756)
(0.741168447, 0.828413756)

on AMD Ryzen 7950X.

Interestingly, I can reproduce the issue with -t 16, but not with -t auto.

Also no problem if Julia is launched with -t 5,1 .

Conclusion: Problem solved by setting the size of the interactive thread pool to one.

Can you try to reproduce these results?

1 Like

I can replicate this as well on my laptop with Julia 1.10.0 and a AMD Ryzen 7 4800H

  • -t 5 gives extrema(ts) = (1.126322503, 2.474583519) → Sometimes slow and no interactive threads
  • -t 5,1 gives extrema(ts) = (1.130685362, 1.339060755) → never slow and 1 interactive thread
  • -t auto (equivalent to -t 16,0 for me) gives:
    extrema(ts) = (1.148305388, 2.280251291) → sometimes slow and no interactive threads

I wonder if this is loosely related to Bug in sleep() function - main thread work affecting sleep duration on running tasks.
Here is a wild guess (as I really don’t anything about the implementation of Tasks and scheduling): Julia needs to schedule the task in some thread. If there are no interactive threads, Julia’s “main thread” is in the same threadpool that works on the tasks. So maybe sometimes a Task gets scheduled on the main thread and starts running before all Tasks where scheduled and so some Tasks are scheduled late. This would not happen if there is at least a single interactive thread because the main thread is always in the interactive pool[1] and the tasks are scheduled to run in the :default pool.
I modified the example above to schedule the tasks on the :interactive pool instead and then the slowdown again occurs. For julia -t 5,5 (so 5 interactive threads and 5 normal ones) I get again extrema(ts) = (1.117211121, 2.510383934).

EDIT: Found another thread that notices (likely) this problem: With julia-1.9, should the main task block :interactive tasks? - #2 by samtkaplan


  1. This can be verified by starting a session with or without interactive threads and just running Threads.threadpool() to see the current threadpool. It is :interactive if there are interactive threads and :default otherwise. ↩︎

2 Likes