Understanding Allocations in multithreaded code

Hi,
I’m working on speeding up particle simulation using multi-threading and I’ve come across odd allocations in the multi-threaded code that I don’t fully understand and much less know how to get rid of.

Here’s a much reduced MWE:
There’s a main loop that does the time steps and at each timestep forces need to be computed for all cells (+interactions that I neglect here). The force computation does quite a bit of 2D SVector juggling
but does not allocate.
However as soon as I add more threads, the allocations increase.

using StaticArrays

mutable struct Particle
    force::SVector{2, Float64}
end

function forces!(particle)
    # Some bogus computations here
    a = SA[1.0, 1.0]
    A = SA[1.0 0.0; 0.0 1.0]
    for i = 1:1000
        a = A*(a-particle.force)
    end
    particle.force = a
    return nothing
end


function computeforces!(particles, N)
    Threads.@threads for n = 1:N
        forces!(particles[n])
    end
end


function main(tmax = 1000)
    particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
    for n = 1:tmax
        N = length(particles)
        computeforces!(particles, N)
    end
end

using BenchmarkTools

@btime main(1000)

$ export JULIA_NUM_THREADS=1                                                                                                                          
$ julia mwe_threading_allocs.jl                                                                                                                              
  432.846 ms (8101 allocations: 785.25 KiB)
$ export JULIA_NUM_THREADS=2                                                                                                                         
$ julia mwe_threading_allocs.jl                                                                                                                              
  226.731 ms (13196 allocations: 1.45 MiB)
$ export JULIA_NUM_THREADS=4                                                                                                                           
$ julia mwe_threading_allocs.jl 
  133.044 ms (23504 allocations: 2.85 MiB)

In this simple example the speedup is still quite good but in our real calculations with more threads, GC time becomes significant with > 20%.

Has anyone else had that problem and is there any way around this?

1 Like

The problem might be that Particle is mutable, so it can’t be stored inline inside an array. Does it really need to be mutable or can’t you just pass a 0-dimensional view of particles to forces!?

I don’t think so. Beside the fact that that is not really an option in the real use case, here’s my attempt at doing that:

struct Particle2
    force::SVector{2, Float64}
end

function forces!(particle)
    oldforce = particle[].force
    a = SA[1.0, 1.0]
    A = SA[1.0 0.0; 0.0 1.0]
    for i = 1:1000
        a = A*(a-oldforce)
    end
    particle[] = Particle2(a)
    return nothing
end


function computeforces!(particles, N)
    Threads.@threads for n = 1:N
        forces!(view(particles,n))
    end
end

with the following results

export JULIA_NUM_THREADS=1
$ julia mwe_threading_allocs.jl 
  431.653 ms (8001 allocations: 783.02 KiB)
$ export JULIA_NUM_THREADS=2
$ julia mwe_threading_allocs.jl 
  230.508 ms (13304 allocations: 1.46 MiB)
$ export JULIA_NUM_THREADS=4
$ julia mwe_threading_allocs.jl
  135.033 ms (23351 allocations: 2.85 MiB)

The memory allocations are in the system, and you don’t really need to worry about it. If you run:

JULIA_NUM_THREADS=1 julia --track-allocation=user mwe_threading_allocs.jl

You get a report that shows the allocations happening at:

        - function main(tmax = 1000)
      128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
        0     for n = 1:tmax
        0         N = length(particles)
   192000         computeforces!(particles, N)
        -     end
        - end

When you increase the threads you get the same results:

JULIA_NUM_THREADS=4 julia --track-allocation=user mwe_threading_allocs.jl

Results in:

        - function main(tmax = 1000)
      128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
        0     for n = 1:tmax
        0         N = length(particles)
   191936         computeforces!(particles, N)
        -     end
        - end

There are no other allocations done by your code. The increase in allocations is because of the synchronization between threads which is done by the system library. Although I’m kind of curious why the 4 threads allocated 64 bytes less…might just be an edge condition.

3 Likes

Yes, that seems to be what is happening in this reduced example, even though the total number and amount of allocations still grows with the number of threads.

In my real code and using julia --trace-allocations=user
I start seeing lots of allocations in my force! function that are 0 when started with only one single thread.

And just not worrying about it is not really a satisfactory answer when
in the real code GC starts using >20% of execution time.

1 Like