Hello,

when trying to parallelize a loop, the performance got absolutely thrashed - roundabout by a factor of 1000. I tried to replicate the problem with a MWE and found something interesting:

```
main() = loop(Vector{Float64}(undef, 100_000))
function loop(arr)
for k in 1:length(arr)
k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end
mainthreaded() = loopthreaded(Vector{Float64}(undef, 100_000))
function loopthreaded(arr)
Threads.@threads for k in 1:length(arr)
k_float = float(k)
arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
end
end
println("===============")
@time main()
@time mainthreaded()
```

When running this script four times in a row, I get the following REPL output:

```
0.004578 seconds (2 allocations: 781.328 KiB)
0.041883 seconds (47.42 k allocations: 3.599 MiB, 95.31% compilation time)
===============
0.004192 seconds (2 allocations: 781.328 KiB)
0.022073 seconds (18.21 k allocations: 1.812 MiB, 91.24% compilation time)
===============
0.004347 seconds (2 allocations: 781.328 KiB)
0.023520 seconds (18.21 k allocations: 1.812 MiB, 91.38% compilation time)
===============
0.005623 seconds (2 allocations: 781.328 KiB)
0.032980 seconds (18.21 k allocations: 1.812 MiB, 29.23% gc time, 95.21% compilation time)
```

So the threaded loop is roughly 5 times slower than the sequential loop (not counting for GC time) after the first execution. I am aware that for such a trivial example, the computational cost of managing threads is much higher than simply calculating the loop, therefore explaining the loss in speed. In my RL-application, the calculation is sufficiently expensive for multithreading to make sense. However, two things really stood out for me:

- The high number of allocations (from my understanding, due to the thread spawning?)
- Every time the
`mainthreaded()`

function is invoked, most of the time is spent as compilation time?

In my real-world application, the Gtk package is used (not in the loop, just in general). Profiling reports that almost all time is spent in `gtk_main()`

, which seems to be the same issue as reported here:

However, even when completely removing Gtk from the project, the issue simply shifts to threading-setup functions.

One suggestion I found was using the `ThreadPools`

package, see here:

However, this didn’t help either.

Therefore, I am wondering whether an individual thread is spawned for each value `k`

and the loop body is recompiled for each single iteration. Could this be true? This could also explain the multitude of threading-related issues here on Discourse, e.g.:

Any insights would be appreciated very much!