Why might I be seeing a large overhead for multiprocessing?

I have an 8 core machine. When I run a certain piece of code with Python’s multiprocessing.Pool library, with 1 core vs. 4 cores, I see an almost 4x speedup for the 4 core case, with very little overhead penalty, assuming the number of iterations is large enough. Specifically, using 4 cores instead of 1 turns a 62 second computation into a 17 second computation: only about a 10% overhead (17/(62/4) = ~1.1).

However, with very similar Julia code I am seeing anywhere from a 50% to a whopping 150% overhead. Unfortunately, the minimum reproducible examples I’ve come up with are on the less dramatic end (15-21% overhead), so I won’t post those, and will instead ask about incomplete code snippets in the context of my larger codebase, and what errors I might be making, ruling out some common ones.

There are no global variables, no parameters are passed to the run_trial() function below, and that function is inside another function, so the loop is not global.

What common errors could be giving me a 50% overhead in the case of:

addprocs(4)
@time @sync @distributed for i in (1:N)
    run_trial()
end

(I define run_trial with @everywhere).

Or a 150% overhead in the case of:

@time Threads.@threads for i in (1:N)
    run_trial()
end

(where Threads.nthreads() returns 4.)

Specifically, these take about 40 seconds for the single-core/thread version, about 15 seconds for the @distributed version, and 20-25 seconds for the Threads.@threads version. I’ve tried upping the number of iterations, but the time proportions are about the same, ruling out a “one time cost of spinning up threads/processes” situation.

Given the things I have ruled out above, what else could be causing this kind of slowdown/overhead? Or is there anything obliviously wrong with my (incomplete) code snippets above? Or is Julia’s parallelization code simply slower than Python’s for now?

(I’m using Python to launch C++ code, thus the similarities in single-threaded speed for the similar computations).

Some of this could be compilation time, which won’t change with threads.

Isn’t this possibility ruled out by the

I’ve tried upping the number of iterations, but the time proportions are about the same, ruling out a “one time cost of spinning up threads/processes” situation.

For example, I’ve tried increasing the number of iterations in the outer loop from 1000 to 5000, but the overhead proportion remains the same. For example if it was 40 seconds single-threaded and 15 seconds distributed (no overhead would be 10 secs), it becomes 200 seconds single-threaded, and 75 seconds distributed (no overhead would be 50 secs).

Doesn’t this rule out compilation as the primary culprit? If compilation was the problem, I would expect the “overhead” proportion of time to decrease towards zero as the number of loops becomes very large. Or is compilation occurring for each new thread or process in the loop?

1 Like

Is there (a lot of) memory allocations? Having GC run in a multi-threaded code can cause a lot of overhead.

A minimal working example would be important, because I have seen different things affect parallelization efficiency.

One example is the use rand() within the code. rand() accesses the global scope and cause problems.

2 Likes

I am not sure if this is representative of your problem, but this is a problem in itself, with the same structure. I am not sure if should open a new thread, please let know.

The observation comes into agreement with your case, where the overhead is, first, too large for such a simple parallelization, and, second, almost independent of the total running cost.

using BenchmarkTools

function run_trial()
    s = 0.
    for i in 1:10_000
      s += sin(i)
    end   
    return s
end

function serial(n)
  for i in 1:n
    run_trial()
  end
end

function parallel(n)
  Threads.@threads for i in 1:n
    run_trial()
  end
end


nsamples = [ 10, 100, 1_000, 10_000 ]


ts = zeros(4)
for i in 1:length(nsamples)
  n = nsamples[i]
  ts[i] = @belapsed serial($n)
end

tp = zeros(4)
for i in 1:length(nsamples)
  n = nsamples[i]
  tp[i] = @belapsed parallel($n)
end

println("nthreads = ",Threads.nthreads())
println("Ratio serial/parallel:")
@. ts/tp



Result:

nthreads = 4
Ratio serial/parallel:
4-element Array{Float64,1}:
 2.8418701520867993
 3.362146655495375
 3.2891125153810337
 3.0830637040329347

(slightly edited the example to guarantee that the interpolation was ok, and changed the way I was measuring the time in view of the comments below)

Why have you used b.times[end] instead of time(b)? The former gives you the maximum time, which is not a good measure of the benchmark. time(b), on the other hand, gives you the minimum time.

Even simpler, you can use @belapsed to get the result directly:

ts[i] = @belapsed serial($n)

On my machine this gives:

nthreads = 4
Ratio serial/parallel:
4-element Array{Float64,1}:
 3.1883971880492092
 3.8477338773872716
 3.761169721425446
 3.7011251309703326

Hyperthreading must also be taken into account. My machine has 4 cores and 8 threads:

nthreads = 8
Ratio serial/parallel:
4-element Array{Float64,1}:
 3.1577821220172875
 4.917520448435623
 4.338376012807572
 4.278896907248903

I thought that was the total time, sorry.

Yet, using time(b), which I understand is the minimum time, I get:

julia> include("./threads.jl")
nthreads = 4
Ratio serial/parallel:
4-element Array{Float64,1}:
 2.966150543190738
 3.5521859165484257
 3.4399227221995456
 2.915680593319755

and using @belapsed, I get:

julia> include("./threads.jl")
nthreads = 4
Ratio serial/parallel:
4-element Array{Float64,1}:
 2.8418701520867993
 3.362146655495375
 3.2891125153810337
 3.0830637040329347


or, using hyperthreading:

julia> include("./threads.jl")
nthreads = 8
Ratio serial/parallel:
4-element Array{Float64,1}:
 2.606308284830257
 3.957226663063897
 3.823176173087636
 3.3907687991020468

My laptop has 4 physical cores (with 8 threads with hyperthreading, as yours)

I don’t know, it seems that for such a simple experiments the results are not as good as one should expect.

EDIT: I tested a similar code in Fortran with OpenMP and the results are not better than those.

(I updated the example above to use the elapsed time)

A minimal working example would be important, because I have seen different things affect parallelization efficiency.

If I had this, I think I would have an answer! Like I said, the minimal examples I tried to construct gave far less dramatic differences than my code. Unfortunately, the codebase is too large and complex to just start pulling code out until the overhead goes away.

One example is the use rand() within the code. rand() accesses the global scope and cause problems.

Thank you for the tip; I’m a bit surprised this wasn’t covered in the manual (unless if I missed it). I modified all the relevant code, and this turned out not to make a big difference in my case for threading. For multiprocessing (the @distributed macro), this should not be a problem or have an effect, correct?

Any other tips/“gotchas” you have similar to the rand() one above would be greatly appreciated. Thank you again.