I have some trivially parallelizable for loops that I wanted to speed up using multi-threading, namely using Folds.jls Folds.foreach(). I started julia with julia -t 8 to get one interactive thread and 8 worker threads and observed that Profiling the code makes it faster:
# Without any profiling
julia> @time expect(args...; n_chunks=1)
8.118154 seconds (5.13 k allocations: 50.581 MiB, 0.21% gc time)
julia> @time expect(args...; n_chunks=8)
16.847068 seconds (64.31 k allocations: 55.753 MiB, 0.05% gc time)
# Profiling with @profile
julia> @time @profile expect(args...; n_chunks=1)
9.119578 seconds (5.13 k allocations: 50.581 MiB, 0.13% gc time)
julia> @time @profile expect(args...; n_chunks=8)
4.647983 seconds (64.16 k allocations: 55.751 MiB, 1.67% gc time)
# Profiling with @profview
julia> @time @profview expect(args...; n_chunks=1)
9.244527 seconds (87.70 k allocations: 79.798 MiB, 0.33% gc time, 0.01% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)
julia> @time @profview expect(args...; n_chunks=8)
2.939202 seconds (266.93 k allocations: 86.765 MiB, 3.42% gc time, 0.02% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)
Furthermore, the n_chunks parameter controls into how many paralelizable batches the data gets split into and we see that without any profiling increasing the number of batches slows the code down, even though it should in principle enable more threads to work in parallel. With profiling on the other hand, we see that increasing the number of batches does speed up the computation (although not as linearly as I had hoped).
Also, when watching htop while running above code I can see that when profiling the code all 8 CPU cores spike in usage, while without profiling they don’t. Overall, I am under the impression that when the code runs without profiling it isn’t actually multi-threading.
Did anyone have similar experiences before and would know what could cause multi-threaded code to run faster when profiled?
Unfortunately, I can’t provide an actual MWE because it is a pretty big code base, but any pointers what to look out for would be very welcome.
Other things I have tried
Using manual Threads.@threads instead of Folds.foreach() ended up being slightlly slower in all cases and the same Heisenbug prevailed.
I also replaced Folds.foreach with a simple foreach as a sanity check and got
julia> @time expect(args...; n_chunks=1)
8.003162 seconds (5.13 k allocations: 50.581 MiB, 0.24% gc time)
julia> @time expect(args...; n_chunks=8)
8.873845 seconds (6.22 k allocations: 51.221 MiB, 0.11% gc time)
julia> @time @profview expect(args...; n_chunks=8)
10.364693 seconds (86.40 k allocations: 87.422 MiB, 0.21% gc time, 0.01% compilation time)
which is as expected. Increasing the number of batches makes the code simply marginally slower, because they are all processed in parallel and profiling the code makes it also a bit slower because of the profiling.