As documented here and in this thread, spinning up julia with a certain number of threads, let us say N, and then starting (N-1) tasks does not necessarily mean that all the tasks will have a thread to run on. It appears that julia uses some of those threads for other purposes, and some of the tasks have to wait for one of the threads to become available. This obviously causes a loss of performance.
So, the question is: Does anyone know what would julia be using the threads for?
On Linux, there is ThreadPinning. I am not aware of any MacOS-solutions.
Could you elaborate?
It appears that julia uses some of those threads for other purposes, and some of the tasks have to wait for one of the threads to become available.
Here, you made it sound like these tasks are waiting on a blocked task before they can begin.
Of course, it is possible for conductivity to yield, and then they get blocked. But at least they were able to begin running relatively quickly.
conductivity does nothing fancy. It is a serial code.
As I am showing, in Strategy A some tasks do not run as fast as others. There is a fast group and a slow group of tasks: slow appear to be waiting for some of the fast tasks to finish to release a computing thread for one of the slow tasks.
MacOS simply doesn’t allow users to pin threads. If it would, I’d immediately add support for it to ThreadPinning.jl. Probably never going to happen though given that Apple has even removed thread affinity control features in the past.
Serial codes can do fancy things with side effects as well.
As I mentioned before in the other threads, you should try to reduce the complexity of your example. This process, while nontrivial and potentially time consuming, will likely get you more insights than just benchmarking and guessing. Also, as I mentioned before as well, your @threads based variant seems to work fine so I recommend you try to investigate what’s different between the two implementations. (Afaics, both variants spawn a task per thread and try to do the same work).
I guess another thing you could try is use Tracy to profile the C runtime.
Note that things work fine with Strategy B even with all the printing.
So I doubt IO is an issue.
I think there is already something to go on as far as evidence for tasks waiting
unnecessarily in Strategy A: fast and slow groups indicate that something causes
threads to be unavailable and tasks cannot run. I do not have an insight into what that something could be.
Probably some internals of the Julia libraries? Hopefully someone here knows…?
module mwe_tasks
using Base.Threads
function work(r)
s = 0.0
for j in r
s = s + exp(j^2)
end
s
end
nchunks = 4
N = 100000000
chunks = [(((i-1)*N+1:i*N), i) for i in 1:nchunks]
s = Float64[]
start = time()
Threads.@sync begin
for ch in chunks
@info "$(ch[2]): Started $(time() - start)"
Threads.@spawn let r = $ch[1], i = $ch[2]
@info "$(i): Spawned $(time() - start)"
push!(s, work(r))
@info "$(i): Finished $(time() - start)"
end
end
end
end
I start julia with 5 threads and run as shown with four tasks. Repeat a few times, and suddenly there is again a group of fast tasks and a group of slow tasks:
Well, your CPU has only 4 cores… I would assume one core is needed for the OS, so you have 3 cores that can useful work in Julia… Don’t be surprised with strange behavior when you oversubscribe your CPU… While hyperthreading in rare cases helps, in practice it often makes things worse, in particular if you try to achieve reproducible timings…
Was running it 50 times now. Consistent results as reported before…
Your fast CPU has 16 next-generation high-performance cores and eight next-generation high-efficiency cores. This means you never know if your threads are running on the fast or on the slow cores.
See also: How to bind threads to performance… | Apple Developer Forums
Good point about the various types of cores on the mac.
Here is another series of measurements though (making it a little bit easier to run a series of trials):
module mwe_tasks
using Base.Threads
function work(r)
s = 0.0
for j in r
s = s + exp(-(j-minimum(r))^2)
end
s
end
function test()
nchunks = 5
N = 100000000
chunks = [(((i-1)*N+1:i*N), i) for i in 1:nchunks]
s = Float64[]
start = time()
Threads.@sync begin
for ch in chunks
@info "$(ch[2]): Started $(time() - start)"
Threads.@spawn let r = $ch[1], i = $ch[2]
@info "$(i): Spawned $(time() - start)"
push!(s, work(r))
@info "$(i): Finished $(time() - start)"
end
end
end
@info "Finished $(time() - start)"
# @show s
end
end
using Main.mwe_tasks;
ts = []
for n in 1:50
push!(ts, @elapsed mwe_tasks.test())
end
@show extrema(ts)
On this machine
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 32 Ă— AMD Ryzen 9 7950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 6 on 32 virtual cores
I get extrema(ts) = (7.40846e-01, 1.47955e+00). So clearly the same behavior is reproduced.
I can replicate this as well on my laptop with Julia 1.10.0 and a AMD Ryzen 7 4800H
-t 5 gives extrema(ts) = (1.126322503, 2.474583519) → Sometimes slow and no interactive threads
-t 5,1 gives extrema(ts) = (1.130685362, 1.339060755) → never slow and 1 interactive thread
-t auto (equivalent to -t 16,0 for me) gives: extrema(ts) = (1.148305388, 2.280251291) → sometimes slow and no interactive threads
I wonder if this is loosely related to Bug in sleep() function - main thread work affecting sleep duration on running tasks.
Here is a wild guess (as I really don’t anything about the implementation of Tasks and scheduling): Julia needs to schedule the task in some thread. If there are no interactive threads, Julia’s “main thread” is in the same threadpool that works on the tasks. So maybe sometimes a Task gets scheduled on the main thread and starts running before all Tasks where scheduled and so some Tasks are scheduled late. This would not happen if there is at least a single interactive thread because the main thread is always in the interactive pool[1] and the tasks are scheduled to run in the :default pool.
I modified the example above to schedule the tasks on the :interactive pool instead and then the slowdown again occurs. For julia -t 5,5 (so 5 interactive threads and 5 normal ones) I get again extrema(ts) = (1.117211121, 2.510383934).