I have a function that does inplace operations with BigInts (so almost no Julia-side allocations in @time report, see below). With --threads=8 it is twice slower than with --threads=1. Note: seems fine without :greedy.
Does anyone know why the slowdown in this example?
In the future, do you have tips on how I can debug such examples by myself?
Thanks.
# crt.jl
using Base.Threads
function crt(x, y, z)
M = BigInt(2)^3000+3
Threads.@threads :greedy for i in 1:length(x)
Base.GMP.MPZ.mul!(x[i], y[i], z[i])
Base.GMP.MPZ.fdiv_q!(x[i], z[i])
Base.GMP.MPZ.fdiv_r!(x[i], M)
end
end
M = 256*256
x = [BigInt() for _ in 1:M]
y = [rand(1:BigInt(2)^3000) for _ in 1:M]
z = [rand(1:BigInt(2)^3000) for _ in 1:M]
@time crt(x, y, z)
@time crt(x, y, z)
Running :
$ julia --threads=1 crt.jl
0.662514 seconds (334.77 k allocations: 58.987 MiB, 1.64% gc time, 34.27% compilation time)
0.492946 seconds (65.06 k allocations: 1018.570 KiB)
$ julia --threads=8 crt.jl
1.180754 seconds (335.61 k allocations: 59.028 MiB, 4.08% gc time, 11989 lock conflicts, 81.44% compilation time)
0.941041 seconds (65.18 k allocations: 1.002 MiB, 11097 lock conflicts)
Running on :
julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Environment:
LD_LIBRARY_PATH = /sw/lib:
My system is different but I get a ~4x speedup, even with 30k+ lock conflicts. Not really sure what the lock conflicts are, the iterations write to different elements, and moving the M instance into the loop doesn’t make a significant difference (minor catch, BigInt(2^3000+3) is the same as BigInt(0+3), you may have meant BigInt(2)^3000 + 3).
@threads :greedy works by spawning a task to put all the elements in a zero-capacity channel, while also spawning Threads.threadpoolsize() tasks to take from the buffer and run the function within.
There is therefore overhead associated with:
Creating the 9 tasks, and destoying them again
Lock contention when 8 readers read from the same channel continuously and are continuously woken up
There also seem to be one allocation per element, which appears to originate in the scheduler itself which is invoked when each worker thread calls take! on the channel. Allocations interact poorly with multiple threads in Julia, since it causes frequent GC runs, which need to be coordinated between the threads (taking up overhead)
Your code takes less than 5 microseconds per iteration, so the overhead swamps the computation time.
What could be done about this?
Use a different scheduler. The greedy scheduler is useful when you have longer-running tasks (milliseconds or longer), where the runtime per element varies.
Maybe the scheduler itself could stop allocating? No idea if this is feasible
The channel could be buffered to reduce the amount of waiting by the workers. However, buffering was explicitly disabled - I have no idea why.
Maybe try doing more work per iteration by chunking (e.g. using ChunkSplitters.jl), and/or increase the problem size? In general, it becomes easier to parallelize things as the computation time increases, because then the overhead tends to matter less.
Five microseconds per element overhead is unacceptably slow in my opinion. This is the kind of time it ought to take to spawn an entirely new task per element.
I suggest using OhMyThreads.jl which makes it very easy to play around with schedulers, chunking, chunksize etc.
using Base.Threads
import OhMyThreads
function crt(x, y, z)
M = BigInt(2)^3000 + 3
Threads.@threads :greedy for i in 1:length(x)
Base.GMP.MPZ.mul!(x[i], y[i], z[i])
Base.GMP.MPZ.fdiv_q!(x[i], z[i])
Base.GMP.MPZ.fdiv_r!(x[i], M)
end
return
end
function crt2(x, y, z, s)
OhMyThreads.@tasks for i in 1:length(x)
OhMyThreads.@local M = BigInt(2)^3000 + 3
OhMyThreads.@set scheduler = s
Base.GMP.MPZ.mul!(x[i], y[i], z[i])
Base.GMP.MPZ.fdiv_q!(x[i], z[i])
Base.GMP.MPZ.fdiv_r!(x[i], M)
end
return
end
M = 256 * 256
x = [BigInt() for _ in 1:M]
y = [rand(1:(BigInt(2)^3000)) for _ in 1:M]
z = [rand(1:(BigInt(2)^3000)) for _ in 1:M]
crt(x, y, z)
@time crt(x, y, z)
s = OhMyThreads.GreedyScheduler()
crt2(x, y, z, s)
@time crt2(x, y, z, s)
s = OhMyThreads.GreedyScheduler(; split = OhMyThreads.Consecutive(), chunking = true)
crt2(x, y, z, s)
@time crt2(x, y, z, s)
$ julia --threads=1 crt.jl
1.693188 seconds (65.07 k allocations: 1019.406 KiB)
0.130777 seconds (65.62 k allocations: 2.933 MiB)
0.126363 seconds (44 allocations: 2.719 KiB)
$ julia --threads=4 crt.jl
0.123298 seconds (65.12 k allocations: 1022.672 KiB, 20151 lock conflicts)
0.037692 seconds (65.65 k allocations: 1.375 MiB, 4050 lock conflicts)
0.032612 seconds (92 allocations: 8.086 KiB)