Memory allocations and performance with multithreading

I’m trying to take use of mulithreading to parallelize an internal loop, but while it works (correct answer and I can see CPU being used), there is a serious performance hit:

function update(x, Δt)
    return x + 0.5 * Δt * x
end


function compute_nothreaded(x0_vals, Δt, nΔt)
        x_vals = copy(x0_vals);
        
        nx = length(x0_vals);
        
        for n in 1:nΔt
            @. x_vals = update(x_vals, Δt);
        end
        return x_vals
end


function compute_threaded(x0_vals, Δt, nΔt)
        x_vals = copy(x0_vals);
        
        nx = length(x0_vals);
        
        for n in 1:nΔt
            Threads.@threads for j in 1:nx
                x_vals[j] = update(x_vals[j], Δt)
            end
        end
        return x_vals
end

Random.seed!(100);
x0 = randn(10);
x₀ = [1.0];
Δt = 0.5;
nΔt = 10^2;

f1(x) = sin(x)
f2(x) = x^2

# compute_threaded(x0, Δt, nΔt, (f1,))

@btime compute_threaded($x0, $Δt, $nΔt)
@btime compute_threaded($x0, $Δt, $nΔt)
@btime compute_threaded($x0, $Δt, $nΔt)

  4.299 ms (4230 allocations: 319.81 KiB)
  4.178 ms (4224 allocations: 319.62 KiB)
  4.309 ms (4227 allocations: 319.72 KiB)

@btime compute_nothreaded($x0, $Δt, $nΔt)
@btime compute_nothreaded($x0, $Δt, $nΔt)
@btime compute_nothreaded($x0, $Δt, $nΔt)

  928.654 ns (1 allocation: 160 bytes)
  882.420 ns (1 allocation: 160 bytes)
  931.420 ns (1 allocation: 160 bytes)

This is the proptotype for a problem with a much more expensive update function.

The actual update function is important in the decision. But you may try FLoops or @avxt from LoopVectorization as alternatives.

Starting thread has its own overhead. In this implementation you launch 100*Threads.nthreads() threads, where each thread do only couple of calculations. You should put threads to an outer loop.

Also, you can’t observe benefits of multithreading for small data, for the same reason (you have overhead). Try to increase number of nx and nDt.

More importantly, it looks like you’re trying to mutate x_vals from multiple threads. Considering the indexing being unique maybe this isn’t somehow thread-unsafe code, although it may still be, but if I re-write your function to just use a parallel map and crank up the size of x0 it benefits from threading the way you would expect. So I think this pattern of looping over something with multiple threads and mutating at individual indices just isn’t a good pattern for performance. Although I bet somebody like @tkf would have a much better/more informative answer…

EDIT: I made a mistake in my quick benchmark and take back the comment that the loop which updates a buffer using multiple threads is a bad pattern. It seems to be very competitive in this setting.

The amount of computation is too small for the current task scheduler to improve the performance by parallelization. You need larger nx so that one for j loop takes at least (say) 10 μs.

x_vals for n depends on n-1. So, there is no trivial way to parallelize this at the outer loop. I think it’s possible to create a spin-based barrier to handle the synchronization at each for n loop. But although it may improve latency, it can hurt throughput if there are other parallel tasks in your julia process.

Yes, it’s actually fine, considering every task touches different j and x_vals is an Array.

(Side note: these two conditions are both important. For example, if xs[j] = true can still be a data race if xs isa BitVector since xs[j] and xs[j+1] can touch the same memory location.)

4 Likes

x_vals for n depends on n-1 . So, there is no trivial way to parallelize this at the outer loop.

In this particular example it is possible to change order of inner and outer loop, since there is no interaction between x_vals[i] and x_vals[j]

function compute_threaded(x0_vals, Δt, nΔt)
        x_vals = copy(x0_vals);
        
        nx = length(x0_vals);
        
        Threads.@threads for j in 1:nx
            for n in 1:nΔt
                x_vals[j] = update(x_vals[j], Δt)
            end
        end
        return x_vals
end

but this is probably not what the author wants, since we can assume that in more complicated MWE dependence between different elements exists. In this case, one can use something like Nbabel nbody integrator speed up - #30 by Skoffer but as it was said a few times already, such approaches are useful only for rather large size of vector.

2 Likes

My bad. You are right. The loops can be flipped in this case.

Yes, in a full version, I would like to be able to do some reductions at each value of n, requiring computation across the array. Regarding the comments about the ``size of the vector’', would, hypothetically, I get speedup if the array remained relatively small, but the update function were much more costly (closer to my actual case)?

I know that I could also do this using Distributed, but I’m a bit reluctant to switch to that because I have some custom data types in my actual problem that I don’t think play nicely with SharedArrays, though that can be further investigated.

As it was said by tkf, rule of thumb is something like 10 μs (it’s very inexact estimate of course) for one internal loop, so you maybe get some acceleration if a single update function runs close to 1 μs, maybe more. But if it does it only because it is inefficient, and generates lots of allocations, then it’ll be affected by GC pressure. And it is better to have optimized version anyway :slight_smile:

Distributed are of no help here, cause their overhead even higher, since they are starting different processes, may be even on different servers, so their intended use when calculation time is much much higher then process interaction time.