when trying to parallelize a loop, the performance got absolutely thrashed - roundabout by a factor of 1000. I tried to replicate the problem with a MWE and found something interesting:
main() = loop(Vector{Float64}(undef, 100_000))
function loop(arr)
for k in 1:length(arr)
k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end
mainthreaded() = loopthreaded(Vector{Float64}(undef, 100_000))
function loopthreaded(arr)
Threads.@threads for k in 1:length(arr)
k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end
println("===============")
@time main()
@time mainthreaded()
When running this script four times in a row, I get the following REPL output:
So the threaded loop is roughly 5 times slower than the sequential loop (not counting for GC time) after the first execution. I am aware that for such a trivial example, the computational cost of managing threads is much higher than simply calculating the loop, therefore explaining the loss in speed. In my RL-application, the calculation is sufficiently expensive for multithreading to make sense. However, two things really stood out for me:
The high number of allocations (from my understanding, due to the thread spawning?)
Every time the mainthreaded() function is invoked, most of the time is spent as compilation time?
In my real-world application, the Gtk package is used (not in the loop, just in general). Profiling reports that almost all time is spent in gtk_main(), which seems to be the same issue as reported here:
However, even when completely removing Gtk from the project, the issue simply shifts to threading-setup functions.
One suggestion I found was using the ThreadPools package, see here:
However, this didn’t help either.
Therefore, I am wondering whether an individual thread is spawned for each value k and the loop body is recompiled for each single iteration. Could this be true? This could also explain the multitude of threading-related issues here on Discourse, e.g.:
When you say “running the script 4 times in a row”, do you mean you are restarting julia inbetween runs? That would explain why you see these compilation times, as the compiled code isn’t cached.
When I run the code twice in the same session, I get the following:
Thank you for your fast answer! Well, this is embarrassing… I run the entire script four times, but of course this redefined the functions, therefore a recompilation was necessary each time. Doing it properly solves the issue. It doesn’t solve my RL-application issue, but I think I have to come up with a new MWE for that
The real issue is that the function without multithreading is precompiled when you run the script whereas the threaded function is not. That is why you don’t see the compilation time of main() every time you re-run the script (even though, as you said, you are redefining the function). I tried adding precompile statements to no avail.
Someone more knowledgeable than me should pitch in as to why this is the case