# Understanding Allocations in multithreaded code

Hi,
I’m working on speeding up particle simulation using multi-threading and I’ve come across odd allocations in the multi-threaded code that I don’t fully understand and much less know how to get rid of.

Here’s a much reduced MWE:
There’s a main loop that does the time steps and at each timestep forces need to be computed for all cells (+interactions that I neglect here). The force computation does quite a bit of 2D `SVector` juggling
but does not allocate.

``````using StaticArrays

mutable struct Particle
force::SVector{2, Float64}
end

function forces!(particle)
# Some bogus computations here
a = SA[1.0, 1.0]
A = SA[1.0 0.0; 0.0 1.0]
for i = 1:1000
a = A*(a-particle.force)
end
particle.force = a
return nothing
end

function computeforces!(particles, N)
forces!(particles[n])
end
end

function main(tmax = 1000)
particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
for n = 1:tmax
N = length(particles)
computeforces!(particles, N)
end
end

using BenchmarkTools

@btime main(1000)

``````
``````\$ export JULIA_NUM_THREADS=1
432.846 ms (8101 allocations: 785.25 KiB)
226.731 ms (13196 allocations: 1.45 MiB)
133.044 ms (23504 allocations: 2.85 MiB)
``````

In this simple example the speedup is still quite good but in our real calculations with more threads, GC time becomes significant with > 20%.

Has anyone else had that problem and is there any way around this?

1 Like

The problem might be that `Particle` is mutable, so it can’t be stored inline inside an array. Does it really need to be mutable or can’t you just pass a 0-dimensional `view` of `particles` to `forces!`?

I don’t think so. Beside the fact that that is not really an option in the real use case, here’s my attempt at doing that:

``````struct Particle2
force::SVector{2, Float64}
end

function forces!(particle)
oldforce = particle[].force
a = SA[1.0, 1.0]
A = SA[1.0 0.0; 0.0 1.0]
for i = 1:1000
a = A*(a-oldforce)
end
particle[] = Particle2(a)
return nothing
end

function computeforces!(particles, N)
forces!(view(particles,n))
end
end
``````

with the following results

``````export JULIA_NUM_THREADS=1
431.653 ms (8001 allocations: 783.02 KiB)
230.508 ms (13304 allocations: 1.46 MiB)
135.033 ms (23351 allocations: 2.85 MiB)

``````

The memory allocations are in the system, and you don’t really need to worry about it. If you run:

``````JULIA_NUM_THREADS=1 julia --track-allocation=user mwe_threading_allocs.jl
``````

You get a report that shows the allocations happening at:

``````        - function main(tmax = 1000)
128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
0     for n = 1:tmax
0         N = length(particles)
192000         computeforces!(particles, N)
-     end
- end
``````

When you increase the threads you get the same results:

``````JULIA_NUM_THREADS=4 julia --track-allocation=user mwe_threading_allocs.jl
``````

Results in:

``````        - function main(tmax = 1000)
128     particles = [Particle(rand(SVector{2, Float64})) for i=1:100]
0     for n = 1:tmax
0         N = length(particles)
191936         computeforces!(particles, N)
-     end
- end
``````

There are no other allocations done by your code. The increase in allocations is because of the synchronization between threads which is done by the system library. Although I’m kind of curious why the 4 threads allocated 64 bytes less…might just be an edge condition.

3 Likes

Yes, that seems to be what is happening in this reduced example, even though the total number and amount of allocations still grows with the number of threads.

In my real code and using `julia --trace-allocations=user`
I start seeing lots of allocations in my `force!` function that are `0` when started with only one single thread.

And just not worrying about it is not really a satisfactory answer when
in the real code GC starts using >20% of execution time.

1 Like