I have an 8 core machine. When I run a certain piece of code with Python’s multiprocessing.Pool library, with 1 core vs. 4 cores, I see an almost 4x speedup for the 4 core case, with very little overhead penalty, assuming the number of iterations is large enough. Specifically, using 4 cores instead of 1 turns a 62 second computation into a 17 second computation: only about a 10% overhead (17/(62/4) = ~1.1).
However, with very similar Julia code I am seeing anywhere from a 50% to a whopping 150% overhead. Unfortunately, the minimum reproducible examples I’ve come up with are on the less dramatic end (15-21% overhead), so I won’t post those, and will instead ask about incomplete code snippets in the context of my larger codebase, and what errors I might be making, ruling out some common ones.
There are no global variables, no parameters are passed to the run_trial() function below, and that function is inside another function, so the loop is not global.
What common errors could be giving me a 50% overhead in the case of:
addprocs(4)
@time @sync @distributed for i in (1:N)
run_trial()
end
(I define run_trial with @everywhere
).
Or a 150% overhead in the case of:
@time Threads.@threads for i in (1:N)
run_trial()
end
(where Threads.nthreads()
returns 4
.)
Specifically, these take about 40 seconds for the single-core/thread version, about 15 seconds for the @distributed
version, and 20-25 seconds for the Threads.@threads
version. I’ve tried upping the number of iterations, but the time proportions are about the same, ruling out a “one time cost of spinning up threads/processes” situation.
Given the things I have ruled out above, what else could be causing this kind of slowdown/overhead? Or is there anything obliviously wrong with my (incomplete) code snippets above? Or is Julia’s parallelization code simply slower than Python’s for now?
(I’m using Python to launch C++ code, thus the similarities in single-threaded speed for the similar computations).