Independent threads much slower by parallelizable

Hi there.
So soon I will give a talk on the Julia programming language in my research group. In that presentation, I wanted to show how well parallelization works in Julia.

So I decided to take the following as an example:

function lorenz(u, p, t)
    dx = 10.0 * (u[2] - u[1])
    dy = u[1] * (28.0 - u[3]) - u[2]
    dz = u[1] * u[2] - (8 / 3) * u[3]
    [dx, dy, dz]
end


function solve_all()
    u0 = [1.0; 0.0; 0.0]
    tspan = (0.0, 15000.0);
    prob = ODEProblem(lorenz, u0, tspan);
    @time solve(prob, Tsit5());
    return nothing
end

@time for i in 1:4 
    solve_all();
end

And the output gives 4 times the required amount of time:

 0.773542 seconds (16.25 M allocations: 1.221 GiB, 18.02% gc time, 34.42% compilation time: 100% of which was recompilation)
  0.554355 seconds (15.69 M allocations: 1.184 GiB, 21.74% gc time)
  0.559194 seconds (15.69 M allocations: 1.184 GiB, 24.41% gc time)
  0.548238 seconds (15.69 M allocations: 1.184 GiB, 28.70% gc time)
  2.436127 seconds (63.32 M allocations: 4.773 GiB, 22.73% gc time, 10.93% compilation time: 100% of which was recompilation)

Now this seems all nice.

However, when I parallelize like this

@time Threads.@threads for i in 1:4
    solve_all();
end

I get the output:

@time Threads.@threads for i in 1:4

    solve_all();

end

I get the following output

  0.922861 seconds (52.87 M allocations: 3.992 GiB, 55.01% gc time, 2.96% compilation time)
  0.953759 seconds (55.08 M allocations: 4.158 GiB, 53.23% gc time, 2.87% compilation time)
  1.309005 seconds (61.15 M allocations: 4.615 GiB, 58.99% gc time, 5.76% compilation time)
  1.343746 seconds (62.75 M allocations: 4.735 GiB, 57.46% gc time)
  1.369621 seconds (62.79 M allocations: 4.738 GiB, 56.38% gc time, 9.29% compilation time)

In short: each individual time the Differential equation is solved takes almost twice as long as when the loop is sequential. Plus there are also a lot more allocations and so on.

Why is this? And how can I fix it?

Thank you very much in advance!

Code that heavily allocates will scale suboptimally when you apply threading. Fortunately there is a big and juicy low-hanging fruit to pick here: Just use an in-place version of your problem function, like so:

That should reduce the allocation a lot since you don’t allocate a new vector each time this function is called. That should improve the scaling with threading as well.

1 Like

Oh sorry! I thought this was the solution but it turns out not to be, with the improvement the solver becomes a lot faster! However, the difference between the parallelized version and the sequential version becomes more extreme. Now I get for the sequential version of the loop:

@time for i=1:16
    @time solve_all()
end
0.000575 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000508 seconds (11.52 k allocations: 1000.578 KiB)
  0.000517 seconds (11.52 k allocations: 1000.578 KiB)
  0.000515 seconds (11.52 k allocations: 1000.578 KiB)
  0.000520 seconds (11.52 k allocations: 1000.578 KiB)
  0.000515 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000518 seconds (11.52 k allocations: 1000.578 KiB)
  0.000513 seconds (11.52 k allocations: 1000.578 KiB)
  0.000528 seconds (11.52 k allocations: 1000.578 KiB)
  0.000514 seconds (11.52 k allocations: 1000.578 KiB)
  0.000517 seconds (11.52 k allocations: 1000.578 KiB)
  0.000524 seconds (11.52 k allocations: 1000.578 KiB)
  0.000544 seconds (11.52 k allocations: 1000.578 KiB)
  0.000533 seconds (11.52 k allocations: 1000.578 KiB)
  0.008547 seconds (184.93 k allocations: 15.674 MiB)

and for the second it I get:

@time Threads.@threads for i=1:16

    @time solve_all()

end

  0.001360 seconds (112.83 k allocations: 9.521 MiB, 14518.17% compilation time)
  0.001597 seconds (130.93 k allocations: 11.396 MiB, 5793.81% compilation time)
  0.001935 seconds (170.02 k allocations: 14.494 MiB, 6824.19% compilation time)
  0.001987 seconds (172.85 k allocations: 14.710 MiB, 8621.50% compilation time)
  0.001974 seconds (170.86 k allocations: 14.550 MiB, 9341.98% compilation time)
  0.002000 seconds (175.23 k allocations: 14.901 MiB, 7912.43% compilation time)
  0.002030 seconds (180.34 k allocations: 15.321 MiB, 5849.08% compilation time)
  0.002024 seconds (180.67 k allocations: 15.361 MiB, 3273.54% compilation time)
  0.002036 seconds (181.42 k allocations: 15.417 MiB, 3899.66% compilation time)
  0.002029 seconds (180.70 k allocations: 15.356 MiB, 5207.72% compilation time)
  0.002053 seconds (181.09 k allocations: 15.369 MiB, 7070.73% compilation time)
  0.002006 seconds (176.95 k allocations: 15.066 MiB)
  0.001964 seconds (175.58 k allocations: 14.951 MiB, 682.24% compilation time)
  0.002050 seconds (180.70 k allocations: 15.366 MiB, 1938.80% compilation time)
  0.002026 seconds (180.56 k allocations: 15.355 MiB, 2621.54% compilation time)
  0.002053 seconds (179.96 k allocations: 15.308 MiB, 1294.04% compilation time)
  0.029549 seconds (216.06 k allocations: 17.826 MiB, 758.20% compilation time)

So it seems as though that the majority of time is needed for the compilation. Any idea how to tackle this?