Understanding Allocations in multithreaded code

JonasIsensee · June 15, 2020, 1:56pm

Hi,
I’m working on speeding up particle simulation using multi-threading and I’ve come across odd allocations in the multi-threaded code that I don’t fully understand and much less know how to get rid of.

Here’s a much reduced MWE:
There’s a main loop that does the time steps and at each timestep forces need to be computed for all cells (+interactions that I neglect here). The force computation does quite a bit of 2D SVector juggling
but does not allocate.
However as soon as I add more threads, the allocations increase.

using StaticArrays

mutable struct Particle
    force::SVector{2, Float64}
end

function forces!(particle)
    # Some bogus computations here
    a = SA[1.0, 1.0]
    A = SA[1.0 0.0; 0.0 1.0]
    for i = 1:1000
        a = A*(a-particle.force)
    end
    particle.force = a
    return nothing
end


function computeforces!(particles, N)
    Threads.@threads for n = 1:N
        forces!(particles[n])
    end
end


function main(tmax = 1000)
    particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
    for n = 1:tmax
        N = length(particles)
        computeforces!(particles, N)
    end
end

using BenchmarkTools

@btime main(1000)

$ export JULIA_NUM_THREADS=1                                                                                                                          
$ julia mwe_threading_allocs.jl                                                                                                                              
  432.846 ms (8101 allocations: 785.25 KiB)
$ export JULIA_NUM_THREADS=2                                                                                                                         
$ julia mwe_threading_allocs.jl                                                                                                                              
  226.731 ms (13196 allocations: 1.45 MiB)
$ export JULIA_NUM_THREADS=4                                                                                                                           
$ julia mwe_threading_allocs.jl 
  133.044 ms (23504 allocations: 2.85 MiB)

In this simple example the speedup is still quite good but in our real calculations with more threads, GC time becomes significant with > 20%.

Has anyone else had that problem and is there any way around this?

simeonschaub · June 15, 2020, 2:09pm

The problem might be that Particle is mutable, so it can’t be stored inline inside an array. Does it really need to be mutable or can’t you just pass a 0-dimensional view of particles to forces!?

JonasIsensee · June 15, 2020, 2:20pm

I don’t think so. Beside the fact that that is not really an option in the real use case, here’s my attempt at doing that:

struct Particle2
    force::SVector{2, Float64}
end

function forces!(particle)
    oldforce = particle[].force
    a = SA[1.0, 1.0]
    A = SA[1.0 0.0; 0.0 1.0]
    for i = 1:1000
        a = A*(a-oldforce)
    end
    particle[] = Particle2(a)
    return nothing
end


function computeforces!(particles, N)
    Threads.@threads for n = 1:N
        forces!(view(particles,n))
    end
end

with the following results

export JULIA_NUM_THREADS=1
$ julia mwe_threading_allocs.jl 
  431.653 ms (8001 allocations: 783.02 KiB)
$ export JULIA_NUM_THREADS=2
$ julia mwe_threading_allocs.jl 
  230.508 ms (13304 allocations: 1.46 MiB)
$ export JULIA_NUM_THREADS=4
$ julia mwe_threading_allocs.jl
  135.033 ms (23351 allocations: 2.85 MiB)

pixel27 · June 15, 2020, 5:24pm

The memory allocations are in the system, and you don’t really need to worry about it. If you run:

JULIA_NUM_THREADS=1 julia --track-allocation=user mwe_threading_allocs.jl

You get a report that shows the allocations happening at:

        - function main(tmax = 1000)
      128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
        0     for n = 1:tmax
        0         N = length(particles)
   192000         computeforces!(particles, N)
        -     end
        - end

When you increase the threads you get the same results:

JULIA_NUM_THREADS=4 julia --track-allocation=user mwe_threading_allocs.jl

Results in:

        - function main(tmax = 1000)
      128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
        0     for n = 1:tmax
        0         N = length(particles)
   191936         computeforces!(particles, N)
        -     end
        - end

There are no other allocations done by your code. The increase in allocations is because of the synchronization between threads which is done by the system library. Although I’m kind of curious why the 4 threads allocated 64 bytes less…might just be an edge condition.

JonasIsensee · June 15, 2020, 5:37pm

Yes, that seems to be what is happening in this reduced example, even though the total number and amount of allocations still grows with the number of threads.

In my real code and using julia --trace-allocations=user
I start seeing lots of allocations in my force! function that are 0 when started with only one single thread.

And just not worrying about it is not really a satisfactory answer when
in the real code GC starts using >20% of execution time.

Topic		Replies	Views
Large memory allocation when a loop is threaded, none when run single-threaded General Usage multithreading , memory-allocation	12	462	October 11, 2021
Allocations in @threads loop General Usage multithreading	3	970	May 20, 2017
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	638	August 7, 2018
Multithreading increases memory allocations Performance	3	298	September 28, 2023
Increased allocations when using threads Performance question	20	352	July 11, 2024

Understanding Allocations in multithreaded code

Related topics