Poor speed gain using `pmap`

Hi, I am using the pmap function for parallelization. However, the speedup is not satisfactory. The speed gain is very far from linear.

As an illustration, I count the time used when each simulation draws 10,000,000 random numbers 2,000 times. This is done by the following code:

using Distributed
using BenchmarkTools
using DataFrames
using PyPlot
using Plots

# Parameters
N = 100
m = 2000
df = DataFrame(time = Float64[], proc = Int[])

# Run simulations
for i = 1:4
    addprocs(i) # Number of procs used
    for j = 1:N
        print
        @everywhere function func(x, n = 10000000)
            y = rand(n)
            return sum(y)
        end
        tmp = @elapsed pmap(func, 1:m) # Get the time used
        push!(df, (time = tmp, proc = i)) # Append the result
    end
    rmprocs(workers()) # Reset the number of procs
end

The following are the average time used for each procs:

  • 1 proc: 41.720s
  • 2 procs: 25.963s
  • 3 procs: 23.241s
  • 4 procs: 22.315s

As seen above, I get around 1.61x speedup when I use 2 procs. The speedup with more procs is minimal.

Therefore, I am wondering if I have not used pmap the most efficient way, or if there is a better way to run the code in parallel.

Any ideas are appreciated. Thanks!

this looks memory bottle-necked to me

Thanks @jling. I am only using 4 procs although length(Sys.cpu_info()) returns 8. My original thought was that using < 8 procs is not going to give memory issues.

Do you think it is probably because there is insufficient RAM on my computer?

What’s your versioninfo()? How much memory does your machine have? The memory you’re requesting is 80 MB (sizeof(Float64) * 10_000_000) per function call.

You’re also defining the function func a total of 4*100 = 400 times, which I don’t think is what you want.

not the total size of it, just the fact that you have 4 processes allocating a shit ton of memory just to do a very cheap sum

1 Like

Regardless of the amount of RAM used, most consumer CPUs only have two memory channels - adding more than two sticks of RAM will have at best a minor impact. Unless your processor is on this list, you’re probably better off trying to limit the amount of memory traffic (maybe by accumulating your sum in-place rather than storing that big vector in memory).

2 Likes

Hi @Sukera:
My versioninfo():

ulia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

My physical memory: 16 GB.

I am running each process 100 times to get the avg. time used.
Thanks!

Ok, but do you actually want to include function compilation in there as well? As it is now, I think you’re redefining func for each measurement, causing recompilation over and over again.

Other than that, allocating so many 80 MB vectors in short succession is bound to put some pressure on the garbage collector. Can you try with return sum(rand() for _ in 1:n) or an explicit loop instead (just to check whether GC has anything to do with it)?

Thanks both @jling and @stillyslalom!

As a general coding practice, do you recommend that I should always accumulate the sum rather than storing a big vector?

Just checked my processor and it’s not on the list. In this case, does it mean that using addprocs(2) is not going to make any big difference with addprocs(x) for any x > 2?

Thanks @Sukera!

The reason that I am defining func in this way is that if I define func outside all the for-loops in my example above, there is an error after I call addprocs(i).
But I agree that a better way to do the above is to define func after addprocs(i) rather than after the j-for-loop. Would this be what you suggest?

I just reran the code with your suggestions, i.e., the code now becomes:

for i = 1:4
    addprocs(i) # Number of procs used
    @everywhere function func(x, n = 10000000)
        return sum(rand() for _ in 1:n)
    end
    for j = 1:N
        tmp = @elapsed pmap(func, 1:m) # Get the time used
        push!(df, (time = tmp, proc = i)) # Append the result
    end
    rmprocs(workers()) # Reset the number of procs
end

The average time is as follows:

  • 1 proc: 68.903s
  • 2 procs: 36.546s (1.885x)
  • 3 procs: 22.901s (3.009x)
  • 4 procs: 19.129s (3.602x)

It looks like that I am having better speed gains now, despite using 4 procs does not give a linear speedup as using 3 procs.

Also, comparing the time used in this new set-up with the original set-up, the new version of the code takes longer if I use only 1 or 2 procs. For instance, it takes 41.72s to do the task in the original example, but it now takes 68.903s. Do you think using an explicit loop might be better than using return sum(rand() for _ in 1:n) ?

Thanks!

You’ll only see diminishing returns for memory-bounded problems where the array doesn’t fit in the processor’s cache and the computation per array element is minimal.

The decision on whether or not to allocate a vector is sensitive to the problem at hand - in particular, to Julia’s ability to translate the intermediate computation to vectorized instructions, which is easier for the compiler when given a vector of concrete type and size. Sometimes, the speed gain from successful vectorization can outweigh the time cost of memory allocation. For example,

julia> @btime sum(_ -> rand(), 1:10^7)
  49.681 ms (0 allocations: 0 bytes)
4.999833123951193e6

julia> @btime sum(rand(10^7))
  21.230 ms (2 allocations: 76.29 MiB)
4.999094016938199e6

However, when we use threaded operations from ThreadsX.jl, the advantage of the non-allocating version becomes clear, since the random number generator is single-threaded.

julia> @btime ThreadsX.sum(_ -> rand(), 1:10^7)
  5.353 ms (1034 allocations: 71.91 KiB)
4.998537194030924e6

julia> @btime ThreadsX.sum(rand(10^7))
  19.819 ms (1029 allocations: 76.37 MiB)
4.99976135990007e6

You could also try multithreaded random number generation, but I don’t think there’s an easy way to do that at the moment.

1 Like

Thanks! Would you recommend using threaded operations against other parallelization methods (like pmap above)?

In addition, in case I come across a situation that ThreadsX is not supporting the function I need, do you think coding it up with @thread is a good way to obtain speed gain?

It depends on the application - the reason there are so many different parallelism tools/approaches isn’t because we all have more than enough time on our hands to create and understand them, but because different algorithms demand different approaches to parallelization. Generally speaking, if your input can be cleanly split into independent parts with minimal inter-process communication, pmap should be fine. If you need all your processes to have lightweight, quasi-simultaneous read/write access to the same chunk of memory (or you don’t want to deal with the overhead of setting up worker processes), multithreading may be a better option.

As is often the case, it’s hard to recommend a particular approach without seeing the code. Can you share a snippet?

3 Likes

Thanks! If I use multithreading, would the benefit of specifying more than two threads be limited as per your earlier comment on memory channels of most consumer CPUs?

If you’re limited by memory bandwidth, there’ll be minimal speedup from additional CPU cores, but there’s also no real downside to just throwing all available cores at the problem, and some potential upside if part of your workload is CPU-limited.

2 Likes

@Sukera Just to provide more details, I run both my original example and the method that you have suggested again. It looks like the speed gain of adding more procs for return sum(rand() for _ in 1:n) is better, but the overall time used is longer than my original example above.

Do you think it has anything to do with the GC?

For your information, the following plot is distributed of time obtained using my original example:

The following plot is the distribution of time used for return sum(rand() for _ in 1:n):

Thanks!

Hi @stillyslalom, sorry for returning back to this earlier post. I try to run the last example in your post again but I am getting the opposite result, where ThreadsX.sum(rand(10^7)) is actually using comparatively less time. I tried a couple of times and retstarting my REPL too. But I am unable to obtain the result.

May I know that did you make some other settings before running the code? Thanks!

julia> @btime ThreadsX.sum(_ -> rand(), 1:10^7)
  27.494 ms (56 allocations: 2.97 KiB)
5.000145773621169e6

julia> @btime ThreadsX.sum(rand(10^7))
  14.773 ms (52 allocations: 76.30 MiB)
5.000568360918457e6

Yes - after all, you’re not changing the function and you’re only repeatedly recompiling it.

I’m not sure how far specialized sum on generators is, but my bet would be that the generator version can’t SIMD, while the explicit loop or reduction over an array can.

That’s hard to say without knowing how much the code allocates for you, but my guess would be that it’s not GC that’s tanking performance here, but rather not being able to SIMD (like I said above).

I’m not surprised by that - the threaded version can probably just split the array and use SIMD to accumulate each individual threaded result before combining it all (on top of that, the array is already generated before the threaded execution even begins), while for the generator version the 4 threads can probably only pick one evaluation of your anonymous function at a time, having to generate the random number themselves and only then get to reduce with +.