Hi there, so I am currently working on a small piece of code that is meant to demonstrate the usefulness of parallelization. Therefore I wrote a small Euler like differential equation solver that solves the Euler equation, which looks like this:
function simple_euler(func, dt, step_number, initial_pos)
solution = zeros((step_number, size(initial_pos, 1)))
current_pos = initial_pos
du = zeros(size(initial_pos, 1))
for x in 1:step_number-1
solution[x, :] = current_pos
du = func(du, current_pos, dt)
current_pos .+= du
end
solution[end, :] = current_pos;
return nothing
end
function lorenz!(du, u, dt)
du[1] = 10.0 * (u[2] - u[1])
du[2] = u[1] * (28.0 - u[3]) - u[2]
du[3] = u[1] * u[2] - (8 / 3) * u[3]
du .*= dt
end
Note that it returns nothing.
Now when, just for fun, I solve the differential equation, say, 10 times. I get for
function do_stuff()
for i in 1:10
@time simple_euler(lorenz!,1e-4, 10^6, [1., 0., 0.]);
end
end
@time do_stuff()
I get
0.015521 seconds (4 allocations: 22.888 MiB)
0.013177 seconds (4 allocations: 22.888 MiB)
0.012344 seconds (4 allocations: 22.888 MiB)
0.014273 seconds (4 allocations: 22.888 MiB, 14.61% gc time)
0.008484 seconds (4 allocations: 22.888 MiB)
0.008462 seconds (4 allocations: 22.888 MiB)
0.008438 seconds (4 allocations: 22.888 MiB)
0.012982 seconds (4 allocations: 22.888 MiB, 34.38% gc time)
0.008476 seconds (4 allocations: 22.888 MiB)
0.012723 seconds (4 allocations: 22.888 MiB)
0.115242 seconds (542 allocations: 228.917 MiB, 5.68% gc time)
Note that it is only four allocations each time the loop runs.
Now I parallelized it like this
function do_stuff_parallel()
@sync for i in 1:10
Threads.@spawn @time simple_euler(lorenz!, 1e-4, 10^6, [1., 0., 0.]);
end
end
@time do_stuff_parallel()
and I get the output
0.008599 seconds (68.98 k allocations: 119.026 MiB, 731.93% compilation time)
0.009811 seconds (69.58 k allocations: 141.964 MiB, 1181.88% compilation time)
0.009247 seconds (55.70 k allocations: 118.151 MiB, 1042.14% compilation time)
0.016347 seconds (111.40 k allocations: 213.414 MiB, 1014.73% compilation time)
0.014449 seconds (83.63 k allocations: 165.788 MiB, 927.87% compilation time)
0.016225 seconds (97.59 k allocations: 189.607 MiB, 929.33% compilation time)
0.012942 seconds (41.95 k allocations: 94.348 MiB, 579.63% compilation time)
0.014647 seconds (28.09 k allocations: 70.536 MiB, 356.08% compilation time)
0.017194 seconds (14.23 k allocations: 46.725 MiB, 158.74% compilation time)
0.017286 seconds (367 allocations: 22.913 MiB)
0.044721 seconds (139.91 k allocations: 238.193 MiB, 424.45% compilation time)
Note that now each time the code in the loop is executed it makes a lot more allocations. Furthermore, it seems as though that it needs to compile the code everytime it runs. However, I called the function before and therefore it should be compiled, shouldn’t it?. Lastely, if I go to a much higher number of iterations say 1000 the situation does not improve significantly. I am running julia with 16 threads on a 32 core processor. Should that not run about 16 times faster as the code is entirely indpenedent of each other and makes very few allocations?
So please, if someone could help me and explain what’s wrong I would highly appreciate it