Parallelization seems to increase the necessary amount of allocations for single threads

Hi there, so I am currently working on a small piece of code that is meant to demonstrate the usefulness of parallelization. Therefore I wrote a small Euler like differential equation solver that solves the Euler equation, which looks like this:

function simple_euler(func, dt, step_number, initial_pos)
    
    solution = zeros((step_number, size(initial_pos, 1)))
    current_pos = initial_pos
    du = zeros(size(initial_pos, 1))
    
    for x in 1:step_number-1
        solution[x, :] = current_pos
        du = func(du, current_pos, dt)
        current_pos .+= du
        
    end
    
    solution[end, :] = current_pos;
    
    return nothing
end

 function lorenz!(du, u, dt)
     du[1] = 10.0 * (u[2] - u[1])
     du[2] = u[1] * (28.0 - u[3]) - u[2]
     du[3] = u[1] * u[2] - (8 / 3) * u[3]
     du .*= dt
end

Note that it returns nothing.

Now when, just for fun, I solve the differential equation, say, 10 times. I get for

function do_stuff()
    for i in 1:10
      @time simple_euler(lorenz!,1e-4, 10^6, [1., 0., 0.]);
    end
end

@time do_stuff()

I get

  0.015521 seconds (4 allocations: 22.888 MiB)
  0.013177 seconds (4 allocations: 22.888 MiB)
  0.012344 seconds (4 allocations: 22.888 MiB)
  0.014273 seconds (4 allocations: 22.888 MiB, 14.61% gc time)
  0.008484 seconds (4 allocations: 22.888 MiB)
  0.008462 seconds (4 allocations: 22.888 MiB)
  0.008438 seconds (4 allocations: 22.888 MiB)
  0.012982 seconds (4 allocations: 22.888 MiB, 34.38% gc time)
  0.008476 seconds (4 allocations: 22.888 MiB)
  0.012723 seconds (4 allocations: 22.888 MiB)
  0.115242 seconds (542 allocations: 228.917 MiB, 5.68% gc time)

Note that it is only four allocations each time the loop runs.

Now I parallelized it like this

function do_stuff_parallel()
    @sync for i in 1:10
         Threads.@spawn @time simple_euler(lorenz!, 1e-4, 10^6, [1., 0., 0.]);
    end
end

@time do_stuff_parallel()

and I get the output

0.008599 seconds (68.98 k allocations: 119.026 MiB, 731.93% compilation time)
  0.009811 seconds (69.58 k allocations: 141.964 MiB, 1181.88% compilation time)
  0.009247 seconds (55.70 k allocations: 118.151 MiB, 1042.14% compilation time)
  0.016347 seconds (111.40 k allocations: 213.414 MiB, 1014.73% compilation time)
  0.014449 seconds (83.63 k allocations: 165.788 MiB, 927.87% compilation time)
  0.016225 seconds (97.59 k allocations: 189.607 MiB, 929.33% compilation time)
  0.012942 seconds (41.95 k allocations: 94.348 MiB, 579.63% compilation time)
  0.014647 seconds (28.09 k allocations: 70.536 MiB, 356.08% compilation time)
  0.017194 seconds (14.23 k allocations: 46.725 MiB, 158.74% compilation time)
  0.017286 seconds (367 allocations: 22.913 MiB)
  0.044721 seconds (139.91 k allocations: 238.193 MiB, 424.45% compilation time)

Note that now each time the code in the loop is executed it makes a lot more allocations. Furthermore, it seems as though that it needs to compile the code everytime it runs. However, I called the function before and therefore it should be compiled, shouldn’t it?. Lastely, if I go to a much higher number of iterations say 1000 the situation does not improve significantly. I am running julia with 16 threads on a 32 core processor. Should that not run about 16 times faster as the code is entirely indpenedent of each other and makes very few allocations?

So please, if someone could help me and explain what’s wrong I would highly appreciate it

Remove @time from within your function, it may allocate itself, and mess things up. Use @btime.

julia> using BenchmarkTools

julia> @btime do_stuff()
  107.465 ms (40 allocations: 228.88 MiB)

julia> @btime do_stuff_parallel()
  21.387 ms (90 allocations: 228.89 MiB)

First of all: Thanks for the quick reply!

This does explain why there are so many more allocations, however, this does not seem to be the full answer. When i removed the @times in the loops and did what you suggested, I got for 100 iterations:

sqequential:

1.000 s (400 allocations: 2.24 GiB)

and for the parallel version

175.350 ms (828 allocations: 2.24 GiB)

This is a bit more than 5 times faster, however given that the code is completely independent plus the fact that there are so few allocations and I am running on 16 cores, wouldn’t one expect that this improves even more?

I can really not see any reason why this would not become much faster for a large number of iterations

Thank you though for clearing this up!

If you look at the implementation of @time, you will see that this is not for multi-threaded use – results will be kinda strange if you start @time in different threads, because allocation and compile time counters are shared between threads.

Nevertheless, the result still looks strange to me.

You are allocating a lot of memory for each iteration (2.24 GiB) in total, and that has an overhead, because memory is only one (or two, or three…) physically.

You can improve on that by allocating the memory fewer times (once per thread). For example:

nthreads = 10
nchunks = 10
  0.789969 seconds (103 allocations: 22.896 MiB)
  0.266457 seconds (811 allocations: 2.235 GiB, 19.68% gc time)
  0.168009 seconds (190 allocations: 228.896 MiB)

The first run is the serial one (I’ve optimized serial the code to be fair, now I only allocate the solution once here).

The second run is allocating a new solution at every iteration.

The third run is by splitting the workload into nthreads() chunks, and allocating one solution per chunk.

Still the scaling is far from ideal (5x for 10 threads), but it improved already. You probably need a something where the computation time is greater relative to memory management time to improve on that. But since your output has step_number size, that’s hard with that example.

Here is the code I’ve run:

Code
using Base.Threads
using ChunkSplitters

function simple_euler(
    func, dt, step_number, initial_pos; 
    solution = zeros((step_number, size(initial_pos, 1)))
)
    current_pos = initial_pos
    du = zeros(size(initial_pos, 1))
    for x in 1:step_number-1
        solution[x, :] = current_pos
        du = func(du, current_pos, dt)
        current_pos .+= du
        
    end
    solution[end, :] = current_pos;
    return nothing
end

function lorenz!(du, u, dt)
     du[1] = 10.0 * (u[2] - u[1])
     du[2] = u[1] * (28.0 - u[3]) - u[2]
     du[3] = u[1] * u[2] - (8 / 3) * u[3]
     du .*= dt
end

function do_stuff(;N=100, step_number=10^6)
    initial_pos = [ 1., 0., 0. ]
    solution = zeros(step_number, length(initial_pos))
    for i in 1:N
        simple_euler(
            lorenz!,1e-4, step_number, initial_pos;
            solution = solution
        )
    end
end

function do_stuff_parallel(;N=100, step_number=10^6, nchunks=nthreads())
    @sync for _ in 1:N
        @spawn begin
            initial_pos = [ 1., 0., 0. ]
            solution = zeros(step_number, length(initial_pos))
            simple_euler(
                lorenz!,1e-4, step_number, initial_pos; 
                solution = solution
            )
        end
    end
end

function do_stuff_parallel2(;N=100, step_number=10^6, nchunks=nthreads())
    @sync for inds in chunks(1:N; n=nchunks)
        @spawn begin
            initial_pos = [ 1., 0., 0. ]
            solution = zeros(step_number, length(initial_pos))
            for _ in inds
                simple_euler(
                    lorenz!,1e-4, step_number, initial_pos; 
                    solution = solution
                )
            end
        end
    end
end

function run(;N=100, step_number=10^6, nchunks=nthreads())
    println("nthreads = ", nthreads())
    println("nchunks = ", nchunks)
    # Compile
    do_stuff(;N=10, step_number=step_number)
    do_stuff_parallel(;N=10, step_number=step_number)
    do_stuff_parallel2(;N=10, step_number=step_number)
    # Run
    @time do_stuff(;N=N, step_number=step_number)
    @time do_stuff_parallel(;N=N, step_number=step_number, nchunks=nchunks)
    @time do_stuff_parallel2(;N=N, step_number=step_number, nchunks=nchunks)
end

ps: note that the time was almost cut in half in the serial version just by allocating solution only once. Thus your code performance was quite memory-bounded, thus making scaling hard.

Thank you, that seems to be the answer! It turns out that for what I want to do, the solution is not strictly necessary. When I remove the solution, I get that the parallelized version seems about 15 times faster.

Be careful to return something from the functions, otherwise the compiler can trick you out.

:+1:

What’s happening is almost surely that we’re saturating the available memory bandwidth on the system. That is a shared-between-cores resource.

So the scaling curve (in number of threads) should first be perfectly linear, and then flat, with a sharp kink. Possibly some contention is shared L3 instead – that would have a smoother kink.

This is not really about memory management – perfectly non-allocating code would have the same issue.

Try not storing the trajectory, only initial and final points. Then you should have your perfect scaling.

Thank you for the nice support!