Independent threads much slower by parallelizable

jabru · January 25, 2024, 12:07am

Hi there.
So soon I will give a talk on the Julia programming language in my research group. In that presentation, I wanted to show how well parallelization works in Julia.

So I decided to take the following as an example:

function lorenz(u, p, t)
    dx = 10.0 * (u[2] - u[1])
    dy = u[1] * (28.0 - u[3]) - u[2]
    dz = u[1] * u[2] - (8 / 3) * u[3]
    [dx, dy, dz]
end


function solve_all()
    u0 = [1.0; 0.0; 0.0]
    tspan = (0.0, 15000.0);
    prob = ODEProblem(lorenz, u0, tspan);
    @time solve(prob, Tsit5());
    return nothing
end

@time for i in 1:4 
    solve_all();
end

And the output gives 4 times the required amount of time:

 0.773542 seconds (16.25 M allocations: 1.221 GiB, 18.02% gc time, 34.42% compilation time: 100% of which was recompilation)
  0.554355 seconds (15.69 M allocations: 1.184 GiB, 21.74% gc time)
  0.559194 seconds (15.69 M allocations: 1.184 GiB, 24.41% gc time)
  0.548238 seconds (15.69 M allocations: 1.184 GiB, 28.70% gc time)
  2.436127 seconds (63.32 M allocations: 4.773 GiB, 22.73% gc time, 10.93% compilation time: 100% of which was recompilation)

Now this seems all nice.

However, when I parallelize like this

@time Threads.@threads for i in 1:4
    solve_all();
end

I get the output:

@time Threads.@threads for i in 1:4

    solve_all();

end

I get the following output

  0.922861 seconds (52.87 M allocations: 3.992 GiB, 55.01% gc time, 2.96% compilation time)
  0.953759 seconds (55.08 M allocations: 4.158 GiB, 53.23% gc time, 2.87% compilation time)
  1.309005 seconds (61.15 M allocations: 4.615 GiB, 58.99% gc time, 5.76% compilation time)
  1.343746 seconds (62.75 M allocations: 4.735 GiB, 57.46% gc time)
  1.369621 seconds (62.79 M allocations: 4.738 GiB, 56.38% gc time, 9.29% compilation time)

In short: each individual time the Differential equation is solved takes almost twice as long as when the loop is sequential. Plus there are also a lot more allocations and so on.

Why is this? And how can I fix it?

Thank you very much in advance!

abraemer · January 25, 2024, 6:34am

Code that heavily allocates will scale suboptimally when you apply threading. Fortunately there is a big and juicy low-hanging fruit to pick here: Just use an in-place version of your problem function, like so:

That should reduce the allocation a lot since you don’t allocate a new vector each time this function is called. That should improve the scaling with threading as well.

jabru · January 25, 2024, 10:26am

Oh sorry! I thought this was the solution but it turns out not to be, with the improvement the solver becomes a lot faster! However, the difference between the parallelized version and the sequential version becomes more extreme. Now I get for the sequential version of the loop:

@time for i=1:16
    @time solve_all()
end

0.000575 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000508 seconds (11.52 k allocations: 1000.578 KiB)
  0.000517 seconds (11.52 k allocations: 1000.578 KiB)
  0.000515 seconds (11.52 k allocations: 1000.578 KiB)
  0.000520 seconds (11.52 k allocations: 1000.578 KiB)
  0.000515 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000518 seconds (11.52 k allocations: 1000.578 KiB)
  0.000513 seconds (11.52 k allocations: 1000.578 KiB)
  0.000528 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000517 seconds (11.52 k allocations: 1000.578 KiB)
  0.000524 seconds (11.52 k allocations: 1000.578 KiB)
  0.000544 seconds (11.52 k allocations: 1000.578 KiB)
  0.000533 seconds (11.52 k allocations: 1000.578 KiB)
  0.008547 seconds (184.93 k allocations: 15.674 MiB)

and for the second it I get:

@time Threads.@threads for i=1:16

    @time solve_all()

end

  0.001360 seconds (112.83 k allocations: 9.521 MiB, 14518.17% compilation time)
  0.001597 seconds (130.93 k allocations: 11.396 MiB, 5793.81% compilation time)
  0.001935 seconds (170.02 k allocations: 14.494 MiB, 6824.19% compilation time)
  0.001987 seconds (172.85 k allocations: 14.710 MiB, 8621.50% compilation time)
  0.001974 seconds (170.86 k allocations: 14.550 MiB, 9341.98% compilation time)
  0.002000 seconds (175.23 k allocations: 14.901 MiB, 7912.43% compilation time)
  0.002030 seconds (180.34 k allocations: 15.321 MiB, 5849.08% compilation time)
  0.002024 seconds (180.67 k allocations: 15.361 MiB, 3273.54% compilation time)
  0.002036 seconds (181.42 k allocations: 15.417 MiB, 3899.66% compilation time)
  0.002029 seconds (180.70 k allocations: 15.356 MiB, 5207.72% compilation time)
  0.002053 seconds (181.09 k allocations: 15.369 MiB, 7070.73% compilation time)
  0.002006 seconds (176.95 k allocations: 15.066 MiB)
  0.001964 seconds (175.58 k allocations: 14.951 MiB, 682.24% compilation time)
  0.002050 seconds (180.70 k allocations: 15.366 MiB, 1938.80% compilation time)
  0.002026 seconds (180.56 k allocations: 15.355 MiB, 2621.54% compilation time)
  0.002053 seconds (179.96 k allocations: 15.308 MiB, 1294.04% compilation time)
  0.029549 seconds (216.06 k allocations: 17.826 MiB, 758.20% compilation time)

So it seems as though that the majority of time is needed for the compilation. Any idea how to tackle this?

Topic		Replies	Views
Parallelization seems to increase the necessary amount of allocations for single threads Performance	8	275	January 26, 2024
Problem on benchmarking multi-thread code Performance	3	385	February 10, 2021
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1959	January 20, 2023
Threads memory allocations General Usage	2	490	January 22, 2020
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	705	March 24, 2023

Independent threads much slower by parallelizable

Related topics