Strange slowdown with @threads :greedy with BigInts

Hi,

I have a function that does inplace operations with BigInts (so almost no Julia-side allocations in @time report, see below). With --threads=8 it is twice slower than with --threads=1. Note: seems fine without :greedy.

  • Does anyone know why the slowdown in this example?
  • In the future, do you have tips on how I can debug such examples by myself?

Thanks.

# crt.jl
using Base.Threads

function crt(x, y, z)
    M = BigInt(2)^3000+3
    Threads.@threads :greedy for i in 1:length(x)
        Base.GMP.MPZ.mul!(x[i], y[i], z[i])
        Base.GMP.MPZ.fdiv_q!(x[i], z[i])
        Base.GMP.MPZ.fdiv_r!(x[i], M)
    end
end

M = 256*256
x = [BigInt() for _ in 1:M]
y = [rand(1:BigInt(2)^3000) for _ in 1:M]
z = [rand(1:BigInt(2)^3000) for _ in 1:M]
@time crt(x, y, z)
@time crt(x, y, z)

Running :

$ julia --threads=1 crt.jl 
  0.662514 seconds (334.77 k allocations: 58.987 MiB, 1.64% gc time, 34.27% compilation time)
  0.492946 seconds (65.06 k allocations: 1018.570 KiB)

$ julia --threads=8 crt.jl 
  1.180754 seconds (335.61 k allocations: 59.028 MiB, 4.08% gc time, 11989 lock conflicts, 81.44% compilation time)
  0.941041 seconds (65.18 k allocations: 1.002 MiB, 11097 lock conflicts)

Running on :

julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Environment:
  LD_LIBRARY_PATH = /sw/lib:

My system is different but I get a ~4x speedup, even with 30k+ lock conflicts. Not really sure what the lock conflicts are, the iterations write to different elements, and moving the M instance into the loop doesn’t make a significant difference (minor catch, BigInt(2^3000+3) is the same as BigInt(0+3), you may have meant BigInt(2)^3000 + 3).

1 Like

Yes, thank you. I edited the original post. (This change does not change the timings much for me)

For one other machine I get:

julia --threads=1 crt.jl 
  0.300064 seconds (334.99 k allocations: 58.966 MiB, 1.34% gc time, 40.95% compilation time)
  0.159975 seconds (65.07 k allocations: 1019.414 KiB)

julia --threads=8 crt.jl 
  0.275606 seconds (335.34 k allocations: 58.984 MiB, 1.70% gc time, 16055 lock conflicts, 217.44% compilation time)
  0.150733 seconds (65.18 k allocations: 1.003 MiB, 13523 lock conflicts)

Machine :

Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

@threads :greedy works by spawning a task to put all the elements in a zero-capacity channel, while also spawning Threads.threadpoolsize() tasks to take from the buffer and run the function within.
There is therefore overhead associated with:

  • Creating the 9 tasks, and destoying them again
  • Lock contention when 8 readers read from the same channel continuously and are continuously woken up
  • There also seem to be one allocation per element, which appears to originate in the scheduler itself which is invoked when each worker thread calls take! on the channel. Allocations interact poorly with multiple threads in Julia, since it causes frequent GC runs, which need to be coordinated between the threads (taking up overhead)

Your code takes less than 5 microseconds per iteration, so the overhead swamps the computation time.

What could be done about this?

  • Use a different scheduler. The greedy scheduler is useful when you have longer-running tasks (milliseconds or longer), where the runtime per element varies.
  • Maybe the scheduler itself could stop allocating? No idea if this is feasible
  • The channel could be buffered to reduce the amount of waiting by the workers. However, buffering was explicitly disabled - I have no idea why.
1 Like

Maybe try doing more work per iteration by chunking (e.g. using ChunkSplitters.jl), and/or increase the problem size? In general, it becomes easier to parallelize things as the computation time increases, because then the overhead tends to matter less.

1 Like

IMO the overhead is unnecessarily large - I made an issue and a PR: Reduce `@threads :greedy` overhead by buffering channel · Issue #59514 · JuliaLang/julia · GitHub

Five microseconds per element overhead is unacceptably slow in my opinion. This is the kind of time it ought to take to spawn an entirely new task per element.

3 Likes

I suggest using OhMyThreads.jl which makes it very easy to play around with schedulers, chunking, chunksize etc.

using Base.Threads

import OhMyThreads

function crt(x, y, z)
    M = BigInt(2)^3000 + 3
    Threads.@threads :greedy for i in 1:length(x)
        Base.GMP.MPZ.mul!(x[i], y[i], z[i])
        Base.GMP.MPZ.fdiv_q!(x[i], z[i])
        Base.GMP.MPZ.fdiv_r!(x[i], M)
    end
    return
end

function crt2(x, y, z, s)
    OhMyThreads.@tasks for i in 1:length(x)
        OhMyThreads.@local M = BigInt(2)^3000 + 3
        OhMyThreads.@set scheduler = s
        Base.GMP.MPZ.mul!(x[i], y[i], z[i])
        Base.GMP.MPZ.fdiv_q!(x[i], z[i])
        Base.GMP.MPZ.fdiv_r!(x[i], M)
    end
    return
end

M = 256 * 256
x = [BigInt() for _ in 1:M]
y = [rand(1:(BigInt(2)^3000)) for _ in 1:M]
z = [rand(1:(BigInt(2)^3000)) for _ in 1:M]

crt(x, y, z)
@time crt(x, y, z)

s = OhMyThreads.GreedyScheduler()
crt2(x, y, z, s)
@time crt2(x, y, z, s)

s = OhMyThreads.GreedyScheduler(; split = OhMyThreads.Consecutive(), chunking = true)
crt2(x, y, z, s)
@time crt2(x, y, z, s)

$ julia --threads=1 crt.jl
  1.693188 seconds (65.07 k allocations: 1019.406 KiB)
  0.130777 seconds (65.62 k allocations: 2.933 MiB)
  0.126363 seconds (44 allocations: 2.719 KiB)

$ julia --threads=4 crt.jl
  0.123298 seconds (65.12 k allocations: 1022.672 KiB, 20151 lock conflicts)
  0.037692 seconds (65.65 k allocations: 1.375 MiB, 4050 lock conflicts)
  0.032612 seconds (92 allocations: 8.086 KiB)
2 Likes

Thank you for the explanations and the suggestions! Now I understand the issue.