Parallelization seems to increase the necessary amount of allocations for single threads

Hi there, so I am currently working on a small piece of code that is meant to demonstrate the usefulness of parallelization. Therefore I wrote a small Euler like differential equation solver that solves the Euler equation, which looks like this:

``````function simple_euler(func, dt, step_number, initial_pos)

solution = zeros((step_number, size(initial_pos, 1)))
current_pos = initial_pos
du = zeros(size(initial_pos, 1))

for x in 1:step_number-1
solution[x, :] = current_pos
du = func(du, current_pos, dt)
current_pos .+= du

end

solution[end, :] = current_pos;

return nothing
end

function lorenz!(du, u, dt)
du[1] = 10.0 * (u[2] - u[1])
du[2] = u[1] * (28.0 - u[3]) - u[2]
du[3] = u[1] * u[2] - (8 / 3) * u[3]
du .*= dt
end
``````

Note that it returns nothing.

Now when, just for fun, I solve the differential equation, say, 10 times. I get for

``````function do_stuff()
for i in 1:10
@time simple_euler(lorenz!,1e-4, 10^6, [1., 0., 0.]);
end
end

@time do_stuff()
``````

I get

``````  0.015521 seconds (4 allocations: 22.888 MiB)
0.013177 seconds (4 allocations: 22.888 MiB)
0.012344 seconds (4 allocations: 22.888 MiB)
0.014273 seconds (4 allocations: 22.888 MiB, 14.61% gc time)
0.008484 seconds (4 allocations: 22.888 MiB)
0.008462 seconds (4 allocations: 22.888 MiB)
0.008438 seconds (4 allocations: 22.888 MiB)
0.012982 seconds (4 allocations: 22.888 MiB, 34.38% gc time)
0.008476 seconds (4 allocations: 22.888 MiB)
0.012723 seconds (4 allocations: 22.888 MiB)
0.115242 seconds (542 allocations: 228.917 MiB, 5.68% gc time)
``````

Note that it is only four allocations each time the loop runs.

Now I parallelized it like this

``````function do_stuff_parallel()
@sync for i in 1:10
Threads.@spawn @time simple_euler(lorenz!, 1e-4, 10^6, [1., 0., 0.]);
end
end

@time do_stuff_parallel()
``````

and I get the output

``````0.008599 seconds (68.98 k allocations: 119.026 MiB, 731.93% compilation time)
0.009811 seconds (69.58 k allocations: 141.964 MiB, 1181.88% compilation time)
0.009247 seconds (55.70 k allocations: 118.151 MiB, 1042.14% compilation time)
0.016347 seconds (111.40 k allocations: 213.414 MiB, 1014.73% compilation time)
0.014449 seconds (83.63 k allocations: 165.788 MiB, 927.87% compilation time)
0.016225 seconds (97.59 k allocations: 189.607 MiB, 929.33% compilation time)
0.012942 seconds (41.95 k allocations: 94.348 MiB, 579.63% compilation time)
0.014647 seconds (28.09 k allocations: 70.536 MiB, 356.08% compilation time)
0.017194 seconds (14.23 k allocations: 46.725 MiB, 158.74% compilation time)
0.017286 seconds (367 allocations: 22.913 MiB)
0.044721 seconds (139.91 k allocations: 238.193 MiB, 424.45% compilation time)
``````

Note that now each time the code in the loop is executed it makes a lot more allocations. Furthermore, it seems as though that it needs to compile the code everytime it runs. However, I called the function before and therefore it should be compiled, shouldn’t it?. Lastely, if I go to a much higher number of iterations say 1000 the situation does not improve significantly. I am running julia with 16 threads on a 32 core processor. Should that not run about 16 times faster as the code is entirely indpenedent of each other and makes very few allocations?

So please, if someone could help me and explain what’s wrong I would highly appreciate it

Remove `@time` from within your function, it may allocate itself, and mess things up. Use `@btime`.

``````julia> using BenchmarkTools

julia> @btime do_stuff()
107.465 ms (40 allocations: 228.88 MiB)

julia> @btime do_stuff_parallel()
21.387 ms (90 allocations: 228.89 MiB)
``````

First of all: Thanks for the quick reply!

This does explain why there are so many more allocations, however, this does not seem to be the full answer. When i removed the @times in the loops and did what you suggested, I got for 100 iterations:

sqequential:

``````1.000 s (400 allocations: 2.24 GiB)
``````

and for the parallel version

``````175.350 ms (828 allocations: 2.24 GiB)
``````

This is a bit more than 5 times faster, however given that the code is completely independent plus the fact that there are so few allocations and I am running on 16 cores, wouldn’t one expect that this improves even more?

I can really not see any reason why this would not become much faster for a large number of iterations

Thank you though for clearing this up!

If you look at the implementation of `@time`, you will see that this is not for multi-threaded use – results will be kinda strange if you start `@time` in different threads, because allocation and compile time counters are shared between threads.

Nevertheless, the result still looks strange to me.

You are allocating a lot of memory for each iteration (2.24 GiB) in total, and that has an overhead, because memory is only one (or two, or three…) physically.

You can improve on that by allocating the memory fewer times (once per thread). For example:

``````nthreads = 10
nchunks = 10
0.789969 seconds (103 allocations: 22.896 MiB)
0.266457 seconds (811 allocations: 2.235 GiB, 19.68% gc time)
0.168009 seconds (190 allocations: 228.896 MiB)
``````

The first run is the serial one (I’ve optimized serial the code to be fair, now I only allocate the `solution` once here).

The second run is allocating a new `solution` at every iteration.

The third run is by splitting the workload into `nthreads()` chunks, and allocating one `solution` per chunk.

Still the scaling is far from ideal (5x for 10 threads), but it improved already. You probably need a something where the computation time is greater relative to memory management time to improve on that. But since your output has `step_number` size, that’s hard with that example.

Here is the code I’ve run:

Code
``````using Base.Threads
using ChunkSplitters

function simple_euler(
func, dt, step_number, initial_pos;
solution = zeros((step_number, size(initial_pos, 1)))
)
current_pos = initial_pos
du = zeros(size(initial_pos, 1))
for x in 1:step_number-1
solution[x, :] = current_pos
du = func(du, current_pos, dt)
current_pos .+= du

end
solution[end, :] = current_pos;
return nothing
end

function lorenz!(du, u, dt)
du[1] = 10.0 * (u[2] - u[1])
du[2] = u[1] * (28.0 - u[3]) - u[2]
du[3] = u[1] * u[2] - (8 / 3) * u[3]
du .*= dt
end

function do_stuff(;N=100, step_number=10^6)
initial_pos = [ 1., 0., 0. ]
solution = zeros(step_number, length(initial_pos))
for i in 1:N
simple_euler(
lorenz!,1e-4, step_number, initial_pos;
solution = solution
)
end
end

@sync for _ in 1:N
@spawn begin
initial_pos = [ 1., 0., 0. ]
solution = zeros(step_number, length(initial_pos))
simple_euler(
lorenz!,1e-4, step_number, initial_pos;
solution = solution
)
end
end
end

@sync for inds in chunks(1:N; n=nchunks)
@spawn begin
initial_pos = [ 1., 0., 0. ]
solution = zeros(step_number, length(initial_pos))
for _ in inds
simple_euler(
lorenz!,1e-4, step_number, initial_pos;
solution = solution
)
end
end
end
end

println("nchunks = ", nchunks)
# Compile
do_stuff(;N=10, step_number=step_number)
do_stuff_parallel(;N=10, step_number=step_number)
do_stuff_parallel2(;N=10, step_number=step_number)
# Run
@time do_stuff(;N=N, step_number=step_number)
@time do_stuff_parallel(;N=N, step_number=step_number, nchunks=nchunks)
@time do_stuff_parallel2(;N=N, step_number=step_number, nchunks=nchunks)
end
``````

ps: note that the time was almost cut in half in the serial version just by allocating `solution` only once. Thus your code performance was quite memory-bounded, thus making scaling hard.

Thank you, that seems to be the answer! It turns out that for what I want to do, the solution is not strictly necessary. When I remove the solution, I get that the parallelized version seems about 15 times faster.

Be careful to return something from the functions, otherwise the compiler can trick you out.

What’s happening is almost surely that we’re saturating the available memory bandwidth on the system. That is a shared-between-cores resource.

So the scaling curve (in number of threads) should first be perfectly linear, and then flat, with a sharp kink. Possibly some contention is shared L3 instead – that would have a smoother kink.

This is not really about memory management – perfectly non-allocating code would have the same issue.

Try not storing the trajectory, only initial and final points. Then you should have your perfect scaling.

Thank you for the nice support!