I recently upgraded my workstation and have been benchmarking a code I had written that I consider to be “embarrassingly parallel”, implemented via the Threads
module. However, I found that I was getting far greater losses in efficiency at higher thread counts that I expected and was hoping for some feedback/advice.
I’m running the benchmarks on the following:
- Ryzen 9 9950X (16 physical cores, 32 virtual threads)
- Ubuntu 24.04.1 LTS
- Julia 1.11.2
The code that I’ve written is to perform Biot-Savart law integration: calculate the magnetic field generated by Wire
segments on a series of locations in 3D space (Node
’s). Each Wire
has an effect on every Node
, so what I’ve done is broken up the calculation such that each thread calculates the effect of a subset of the Wire
’s on every node, and the calculation result is summed at the end.
I benchmark the problem like this:
N = 1000
Nt = 16 # 16 threads
nodes = rand(N,3)
wires = Vector{Wire}(undef,N)
for i in range(1,N)
wires[i] = Wire(rand(3), rand(3), 1.0, 1.0) # Custom struct defined in my module
end
@benchmark bfield($nodes, $wires, Nt=Nt) # Function defined in my module
Sample output:
julia> @benchmark bfield($nodes, $wires, Nt=16)
BenchmarkTools.Trial: 7173 samples with 1 evaluation.
Range (min … max): 546.206 μs … 13.557 ms ┊ GC (min … max): 0.00% … 94.43%
Time (median): 603.195 μs ┊ GC (median): 0.00%
Time (mean ± σ): 694.505 μs ± 389.712 μs ┊ GC (mean ± σ): 11.35% ± 15.04%
▆█▇▇▆▅▄▁ ▁▁▁▁▁ ▂
████████▇▆▄▃▁▅▅▁▁▃▅▁▁▁▃▁▁▃▄▅▆██████████▇▇▇█▇▇▇▆▆▇▇▇▆▇▆▆▆▆▇▅▆▆ █
546 μs Histogram: log(frequency) by time 1.85 ms <
Memory estimate: 2.49 MiB, allocs estimate: 808.
The function bfield
is defined in this way:
function bfield(nodes::AbstractArray, wires::Vector{Wire}; Nt::Integer=0)
Ns = length(wires)
if Nt == 0
# Default is to use all available threads
Nt = Threads.nthreads()
end
# Spawn a new task for each thread by splitting up the source array
tasks = Vector{Task}(undef, Nt)
for it = 1:Nt
@views tasks[it] = Threads.@spawn biotsavart(nodes, wires[threadindices(it, Nt, Ns)])
end
# Get the result from each calculation and add it to the output array
B = zeros(size(nodes))
for it = 1:Nt
B .+= fetch(tasks[it])
end
return B
end
threadindices
is defined elsewhere and simply splits up the problem into nearly identical sizes for each thread.
I define the following benchmarks:
speedup = time_for_Nt_threads / time_for_1_thread
efficiency = speedup/Nt
For a perfectly efficient problem, efficiency = 1.0
for all threads. But of course that can’t happen in the real world…
In theory, I’d like to run my code with problem sizes on the order of N=1e6
, but I typically benchmark at lower N
values to save time. The results I get for N=(1000,10000,30000)
:
At lower thread counts, I’m happy with the efficiency: 94% at 8 threads for the larger problem size is really good! However, at 16 threads, the program actually becomes slower!
Is there anything I can do to improve the efficiency of my code given the snippets I’ve provided? Is the drop in efficiency at 16 threads solely because the time it takes to start so many threads and to sum the answer at the end isn’t worth the minimal solution time gains?