Task scheduling puzzle

I have a somewhat puzzling behaviour with task scheduling. I’ve this example:

using BenchmarkTools
using Base.Threads

function work(n)
    x = pi/4
    for i in 1:n
        x *= 4*(1-x)
        (i & 0x3ff == 0) && yield()
    return x

function run()
    @sync for i in 1:nthreads()
        @spawn work(10_000_000)

@btime run()

I run it with varying number of threads on a 12-core AMD:

$ for i in $(seq 1 12); do echo -n $i' '; JULIA_EXCLUSIVE=1 julia -t $i threads.jl; done
1   23.743 ms (14 allocations: 928 bytes)
2   24.579 ms (19 allocations: 1.39 KiB)
3   25.206 ms (24 allocations: 1.88 KiB)
4   25.668 ms (29 allocations: 2.36 KiB)
5   26.727 ms (34 allocations: 2.84 KiB)
6   27.717 ms (39 allocations: 3.33 KiB)
7   52.496 ms (44 allocations: 3.81 KiB)
8   144.130 ms (49 allocations: 4.30 KiB)
9   190.985 ms (55 allocations: 5.14 KiB)
10   252.950 ms (60 allocations: 5.62 KiB)
11   318.004 ms (65 allocations: 6.11 KiB)
12   402.408 ms (70 allocations: 6.59 KiB)

I can understand some increased time with more threads, but this is quite massive. Without the yield(), the time is constant, around 20 ms. So, what is going on here?

On 1.10.2 I get the following output:

$ for i in $(seq 1 12); do echo -n $i' '; JULIA_EXCLUSIVE=1 /snap/bin/julia -t $i threads.jl; done
1   23.372 ms (14 allocations: 928 bytes)
2   24.328 ms (19 allocations: 1.39 KiB)
3   25.185 ms (24 allocations: 1.88 KiB)
4   25.810 ms (29 allocations: 2.36 KiB)
5   26.423 ms (34 allocations: 2.84 KiB)
6   27.644 ms (39 allocations: 3.33 KiB)
7   41.494 ms (44 allocations: 3.81 KiB)
8   123.716 ms (49 allocations: 4.30 KiB)
9   50.254 ms (55 allocations: 5.11 KiB)
10   55.592 ms (60 allocations: 5.59 KiB)
11   60.642 ms (65 allocations: 6.08 KiB)
12   74.447 ms (70 allocations: 6.56 KiB)

I’ve run this several times, it’s roughly the same every time, the behaviour is not due to other activity in the box.

Julia Version 1.12.0-DEV.317
Commit 0e28cf6abf* (2024-04-08 10:46 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)