Threaded loop far slower than sequential loop (+ high compilation time)

Hello,

when trying to parallelize a loop, the performance got absolutely thrashed - roundabout by a factor of 1000. I tried to replicate the problem with a MWE and found something interesting:

main() = loop(Vector{Float64}(undef, 100_000))
function loop(arr)
    for k in 1:length(arr)
        k_float = float(k)
        arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
    end
end

mainthreaded() = loopthreaded(Vector{Float64}(undef, 100_000))
function loopthreaded(arr)
    Threads.@threads for k in 1:length(arr)
        k_float = float(k)
        arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
    end
end

println("===============")
@time main()
@time mainthreaded()

When running this script four times in a row, I get the following REPL output:

  0.004578 seconds (2 allocations: 781.328 KiB)
  0.041883 seconds (47.42 k allocations: 3.599 MiB, 95.31% compilation time)
===============
  0.004192 seconds (2 allocations: 781.328 KiB)
  0.022073 seconds (18.21 k allocations: 1.812 MiB, 91.24% compilation time)
===============
  0.004347 seconds (2 allocations: 781.328 KiB)
  0.023520 seconds (18.21 k allocations: 1.812 MiB, 91.38% compilation time)
===============
  0.005623 seconds (2 allocations: 781.328 KiB)
  0.032980 seconds (18.21 k allocations: 1.812 MiB, 29.23% gc time, 95.21% compilation time)

So the threaded loop is roughly 5 times slower than the sequential loop (not counting for GC time) after the first execution. I am aware that for such a trivial example, the computational cost of managing threads is much higher than simply calculating the loop, therefore explaining the loss in speed. In my RL-application, the calculation is sufficiently expensive for multithreading to make sense. However, two things really stood out for me:

  1. The high number of allocations (from my understanding, due to the thread spawning?)
  2. Every time the mainthreaded() function is invoked, most of the time is spent as compilation time?

In my real-world application, the Gtk package is used (not in the loop, just in general). Profiling reports that almost all time is spent in gtk_main(), which seems to be the same issue as reported here:

However, even when completely removing Gtk from the project, the issue simply shifts to threading-setup functions.

One suggestion I found was using the ThreadPools package, see here:

However, this didn’t help either.

Therefore, I am wondering whether an individual thread is spawned for each value k and the loop body is recompiled for each single iteration. Could this be true? This could also explain the multitude of threading-related issues here on Discourse, e.g.:

Any insights would be appreciated very much!

When you say “running the script 4 times in a row”, do you mean you are restarting julia inbetween runs? That would explain why you see these compilation times, as the compiled code isn’t cached.

When I run the code twice in the same session, I get the following:

julia> @time main()
  0.003548 seconds (2 allocations: 781.328 KiB)

julia> @time mainthreaded()
  0.037268 seconds (47.78 k allocations: 3.629 MiB, 47.28% compilation time)

julia> @time main()
  0.003816 seconds (2 allocations: 781.328 KiB)

julia> @time mainthreaded()
  0.001187 seconds (127 allocations: 793.719 KiB)

So the threaded code is, in fact, faster (this is a 12 core machine).
Cheers
Simon

Thank you for your fast answer! Well, this is embarrassing… I run the entire script four times, but of course this redefined the functions, therefore a recompilation was necessary each time. Doing it properly solves the issue. It doesn’t solve my RL-application issue, but I think I have to come up with a new MWE for that :smiley:

The real issue is that the function without multithreading is precompiled when you run the script whereas the threaded function is not. That is why you don’t see the compilation time of main() every time you re-run the script (even though, as you said, you are redefining the function). I tried adding precompile statements to no avail.

Someone more knowledgeable than me should pitch in as to why this is the case :slight_smile: