I have a somewhat puzzling behaviour with task scheduling. I’ve this example:
using BenchmarkTools
using Base.Threads
function work(n)
x = pi/4
for i in 1:n
x *= 4*(1-x)
(i & 0x3ff == 0) && yield()
end
return x
end
function run()
@sync for i in 1:nthreads()
@spawn work(10_000_000)
end
end
@btime run()
I run it with varying number of threads on a 12-core AMD:
$ for i in $(seq 1 12); do echo -n $i' '; JULIA_EXCLUSIVE=1 julia -t $i threads.jl; done
1 23.743 ms (14 allocations: 928 bytes)
2 24.579 ms (19 allocations: 1.39 KiB)
3 25.206 ms (24 allocations: 1.88 KiB)
4 25.668 ms (29 allocations: 2.36 KiB)
5 26.727 ms (34 allocations: 2.84 KiB)
6 27.717 ms (39 allocations: 3.33 KiB)
7 52.496 ms (44 allocations: 3.81 KiB)
8 144.130 ms (49 allocations: 4.30 KiB)
9 190.985 ms (55 allocations: 5.14 KiB)
10 252.950 ms (60 allocations: 5.62 KiB)
11 318.004 ms (65 allocations: 6.11 KiB)
12 402.408 ms (70 allocations: 6.59 KiB)
I can understand some increased time with more threads, but this is quite massive. Without the yield()
, the time is constant, around 20 ms. So, what is going on here?
On 1.10.2 I get the following output:
$ for i in $(seq 1 12); do echo -n $i' '; JULIA_EXCLUSIVE=1 /snap/bin/julia -t $i threads.jl; done
1 23.372 ms (14 allocations: 928 bytes)
2 24.328 ms (19 allocations: 1.39 KiB)
3 25.185 ms (24 allocations: 1.88 KiB)
4 25.810 ms (29 allocations: 2.36 KiB)
5 26.423 ms (34 allocations: 2.84 KiB)
6 27.644 ms (39 allocations: 3.33 KiB)
7 41.494 ms (44 allocations: 3.81 KiB)
8 123.716 ms (49 allocations: 4.30 KiB)
9 50.254 ms (55 allocations: 5.11 KiB)
10 55.592 ms (60 allocations: 5.59 KiB)
11 60.642 ms (65 allocations: 6.08 KiB)
12 74.447 ms (70 allocations: 6.56 KiB)
I’ve run this several times, it’s roughly the same every time, the behaviour is not due to other activity in the box.
Julia Version 1.12.0-DEV.317
Commit 0e28cf6abf* (2024-04-08 10:46 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)