Improving Efficiency of Embarassingly Parallel Problem Using `Threads`

freestatelabs · December 16, 2024, 6:45pm

I recently upgraded my workstation and have been benchmarking a code I had written that I consider to be “embarrassingly parallel”, implemented via the Threads module. However, I found that I was getting far greater losses in efficiency at higher thread counts that I expected and was hoping for some feedback/advice.

I’m running the benchmarks on the following:

Ryzen 9 9950X (16 physical cores, 32 virtual threads)
Ubuntu 24.04.1 LTS
Julia 1.11.2

The code that I’ve written is to perform Biot-Savart law integration: calculate the magnetic field generated by Wire segments on a series of locations in 3D space (Node’s). Each Wire has an effect on every Node, so what I’ve done is broken up the calculation such that each thread calculates the effect of a subset of the Wire’s on every node, and the calculation result is summed at the end.

I benchmark the problem like this:

N = 1000
Nt = 16       # 16 threads
nodes = rand(N,3) 
wires = Vector{Wire}(undef,N)
for i in range(1,N)
    wires[i] = Wire(rand(3), rand(3), 1.0, 1.0)    # Custom struct defined in my module 
end 

@benchmark bfield($nodes, $wires, Nt=Nt)       # Function defined in my module

Sample output:

julia> @benchmark bfield($nodes, $wires, Nt=16)
BenchmarkTools.Trial: 7173 samples with 1 evaluation.
 Range (min … max):  546.206 μs …  13.557 ms  ┊ GC (min … max):  0.00% … 94.43%
 Time  (median):     603.195 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   694.505 μs ± 389.712 μs  ┊ GC (mean ± σ):  11.35% ± 15.04%

  ▆█▇▇▆▅▄▁                        ▁▁▁▁▁                         ▂
  ████████▇▆▄▃▁▅▅▁▁▃▅▁▁▁▃▁▁▃▄▅▆██████████▇▇▇█▇▇▇▆▆▇▇▇▆▇▆▆▆▆▇▅▆▆ █
  546 μs        Histogram: log(frequency) by time       1.85 ms <

 Memory estimate: 2.49 MiB, allocs estimate: 808.

The function bfield is defined in this way:

function bfield(nodes::AbstractArray, wires::Vector{Wire}; Nt::Integer=0)

    Ns = length(wires)
    if Nt == 0 
        # Default is to use all available threads
        Nt = Threads.nthreads()
    end

    # Spawn a new task for each thread by splitting up the source array
    tasks = Vector{Task}(undef, Nt)
    for it = 1:Nt 
        @views tasks[it] = Threads.@spawn biotsavart(nodes, wires[threadindices(it, Nt, Ns)])
    end
    
    # Get the result from each calculation and add it to the output array 
    B = zeros(size(nodes))
    for it = 1:Nt 
        B .+= fetch(tasks[it]) 
    end 

    return B
end

threadindices is defined elsewhere and simply splits up the problem into nearly identical sizes for each thread.

I define the following benchmarks:
speedup = time_for_Nt_threads / time_for_1_thread
efficiency = speedup/Nt

For a perfectly efficient problem, efficiency = 1.0 for all threads. But of course that can’t happen in the real world…

In theory, I’d like to run my code with problem sizes on the order of N=1e6, but I typically benchmark at lower N values to save time. The results I get for N=(1000,10000,30000):

At lower thread counts, I’m happy with the efficiency: 94% at 8 threads for the larger problem size is really good! However, at 16 threads, the program actually becomes slower!

Is there anything I can do to improve the efficiency of my code given the snippets I’ve provided? Is the drop in efficiency at 16 threads solely because the time it takes to start so many threads and to sum the answer at the end isn’t worth the minimal solution time gains?

ufechner7 · December 16, 2024, 6:57pm

There are multiple factors that limit the efficiency when using many threads. For example:

the CPU frequency drops the more cores are active (4.3 GHz all cores, 5.7 GHz one core)
the data needed for many cores might not fit into the cache any longer
the bandwidth of the RAM might be limiting the performance with many cores
the Julia garbage collector has more work with many threads

The last point can be mitigated by using multi-threading for the GC.

freestatelabs · December 16, 2024, 8:01pm

Ah yeah, I should have expected this not to be a simple answer

Is there a way to determine these a priori, so that the program can choose the [most likely] efficient number of threads for the user? Or can this only be known after the fact using benchmarks like this, on a per machine basis?

Oscar_Smith · December 16, 2024, 8:50pm

a bit of both. the core frequency is per CPU, but really just means you should be measuring efficiency differently (i.e. scale the result by clock speed with that many cores).

Cache and Memory pressure are always a bit tricky, but it’s a good idea to think about the problem from a roofline model perspective (i.e. how much data do you need to move and how much compute do you do on each piece of data).

that said the general answer is that the fastest approach usually is to write the code, profile to see what the bottlenecks are and fix them. there isn’t really a shortcut to fast code.

Tetrakai · December 16, 2024, 9:18pm

Try getting rid of those allocations for sure. Eg, use SVectors and Polyester.@batch. My experience with multithreading is you will greatly benefit from avoiding allocations (and hence GC).

sgaure · December 16, 2024, 10:32pm

Also, note that fetch is not type stable. To avoid problems you should annotate it with whatever it returns, e.g. fetch(tasks[it])::Vector{Float64}. I don’t think it slows down things dramatically in this case, but it might allocate some bytes.

foobar_lv2 · December 16, 2024, 11:58pm

To extend on the type-stability of fetch: You could annotate it, but that’s not really important. The reason why its unimportant is that you have a function barrier (the broadcast in B .+= ...) that cures the instability, so you’re only paying for Nt many Any-dispatches, i.e. pretty negligible.

To extend on threading overhead: The rule-of-thumb is that you should be suspicious of work-units that are < 1ms. Your work-units tend to be >5ms from eyeballing your results, so no need to worry or use polyester-batch.

On the other hand, memory / cache use is super important!

Since you didn’t show the definition of Wire and biotsavart, I can only guess at what you’re doing. Important things are:

Wire should be a struct (not mutable struct!) and its two first Vector components (presumably position and direction?) should be SVector from StaticArrays. Possibly you’re already doing that and converting in the Wire constructor? If you don’t, you can add a constructor that does the conversion, for compatibility with existing code.
You’re using Float64. Do you really need Float64? Float32 may be enough for Node and Wire.
If you want to compute a field with sources in wires and evaluate the field at nodes with position nodes, then you should use blocking.

Blocking means the following: You have two vectors A and B and a function foo, and you want to compute res = [sum(f(a, b) for b in B) for a in A], i.e.

for i in 1:length(A)
  for j in 1:length(B)
    res[i] += f(A[i], B[j])
  end
end

Your CPU fetches A[1], B[1], A[1], B[2], ... B[n], A[2], B[1], .... This is totally fine if your cache remembers B[1] for subsequent accesses, and it sucks if B doesn’t fit into your cache. (which cache? all of them matter! L1, L2, L3!)

The answer are cache-oblivious algorithms / iteration orders (e.g. hilbert-curve). Unfortunately I haven’t seen a good ready-made library for that in julia?

(actual hilbert-curve kinda sucks, one needs an appropriately-sized simd-friendly base-case)

This is not very complicated, worst-case the discourse-denizens here can write something up.

PS. If my guesses to your problem are correct, then it is not “embarassingly parallel” for large instances, since you need cache-oblivious iteration order, space-filling curves, and ideally scheduling that makes optimal use of the shared L3 cache if you want to feed all your FLOPS.

It’s not as hard as matrix multiplication (for input size N you have N^2 ops; matmul/GEMM has only N^1.5 ops, i.e. a lower arithmetic density. So you don’t have to work as hard to not starve your compute of memory).

Tetrakai · December 17, 2024, 1:32pm

Are there examples for this rule of thumb? I’ve seen (approx values):

@threads : 15 ms (150 allocations)
@batch   :  5 ms (1 allocation)

Ie, the @threads overhead partly is the heap allocations (which I’ve gathered are required for the better composability when nesting parallel loops).

Topic		Replies	Views
A question about parallel performance in multithreading Performance question , performance , multithreading	10	653	May 12, 2022
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1595	October 16, 2021
Threads (including Polyester) question Performance	1	460	November 14, 2022
Multi-threading doesn't seem to scale well on this counting problem Performance multithreading	15	2150	January 14, 2018
Parallelization on the CPU isn't effective General Usage	19	541	November 19, 2021

Improving Efficiency of Embarassingly Parallel Problem Using `Threads`

Related topics