# Threaded loop far slower than sequential loop (+ high compilation time)

Hello,

when trying to parallelize a loop, the performance got absolutely thrashed - roundabout by a factor of 1000. I tried to replicate the problem with a MWE and found something interesting:

``````main() = loop(Vector{Float64}(undef, 100_000))
function loop(arr)
for k in 1:length(arr)
k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end

k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end

println("===============")
@time main()
``````

When running this script four times in a row, I get the following REPL output:

``````  0.004578 seconds (2 allocations: 781.328 KiB)
0.041883 seconds (47.42 k allocations: 3.599 MiB, 95.31% compilation time)
===============
0.004192 seconds (2 allocations: 781.328 KiB)
0.022073 seconds (18.21 k allocations: 1.812 MiB, 91.24% compilation time)
===============
0.004347 seconds (2 allocations: 781.328 KiB)
0.023520 seconds (18.21 k allocations: 1.812 MiB, 91.38% compilation time)
===============
0.005623 seconds (2 allocations: 781.328 KiB)
0.032980 seconds (18.21 k allocations: 1.812 MiB, 29.23% gc time, 95.21% compilation time)
``````

So the threaded loop is roughly 5 times slower than the sequential loop (not counting for GC time) after the first execution. I am aware that for such a trivial example, the computational cost of managing threads is much higher than simply calculating the loop, therefore explaining the loss in speed. In my RL-application, the calculation is sufficiently expensive for multithreading to make sense. However, two things really stood out for me:

1. The high number of allocations (from my understanding, due to the thread spawning?)
2. Every time the `mainthreaded()` function is invoked, most of the time is spent as compilation time?

In my real-world application, the Gtk package is used (not in the loop, just in general). Profiling reports that almost all time is spent in `gtk_main()`, which seems to be the same issue as reported here:

However, even when completely removing Gtk from the project, the issue simply shifts to threading-setup functions.

One suggestion I found was using the `ThreadPools` package, see here:

However, this didn’t help either.

Therefore, I am wondering whether an individual thread is spawned for each value `k` and the loop body is recompiled for each single iteration. Could this be true? This could also explain the multitude of threading-related issues here on Discourse, e.g.:

Any insights would be appreciated very much!

When you say “running the script 4 times in a row”, do you mean you are restarting julia inbetween runs? That would explain why you see these compilation times, as the compiled code isn’t cached.

When I run the code twice in the same session, I get the following:

``````julia> @time main()
0.003548 seconds (2 allocations: 781.328 KiB)

0.037268 seconds (47.78 k allocations: 3.629 MiB, 47.28% compilation time)

julia> @time main()
0.003816 seconds (2 allocations: 781.328 KiB)

The real issue is that the function without multithreading is precompiled when you run the script whereas the threaded function is not. That is why you don’t see the compilation time of `main()` every time you re-run the script (even though, as you said, you are redefining the function). I tried adding precompile statements to no avail.