Why is the parallel map so slow?

I have an array a defined as:
julia> a = collect(1:10000)
I can map a squaring function over the array and time the execution as following:
julia> @time map((x)->x^2, a)
On my machine, the line above runs in 0.05 seconds.

I then try to see the performance gain by running processes in parallel:

julia> using Distributed

julia> addprocs(4)

julia> @time pmap((x)->x^2, a)

and I find that pmap runs in 0.5 seconds, which makes it 10 times slower than map. The documentation says that

pmap is designed for the case where each function call does a large amount of work. In contrast, @distributed for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

I assume that this is the reason for the slow down since my function call is only squaring a number. However, what is the deeper explanation for this? Why does pmap face difficulty in situations where the function call does little work? And more importantly: how to efficiently parallelize the procedure of mapping a small function onto a large array (as in the example above)?

@distributed partitions the iteration space into a piece per worker, and then those workers handle many pieces.
pmap hands off the pieces one at a time.

Both approaches have pros and cons. Every bit of work delivered to a worker has overhead. That overhead is much higher than a simple operation like squaring a number. But, if the worker is given the task of squaring a huge amount of numbers (like would be the case with @distributed), that cost could be amortized over that huge amount.
However, lets say you have very few tasks to do, they each take a while (doesn’t have to be long for the overhead to be negligible!), but vary in just how long. Perhaps on top of that variation, the number of jobs to be done doesn’t divide well by the number of workers, e.g. 10 jobs and 8 workers.
If you hand these off to workers in advance, maybe the 4 slowest tasks end up with workers 1 and 2, while workers 5 and 8 receive relatively short tasks. To workers 5 and 8 finish soon, and then have nothing to do and sit idle. While workers 1 and 2 chug along for a long time on their first tasks before finally finishing them, after which they begin chugging along on their second task. What we would have wanted is 5 and 8 to be handed incomplete tasks as soon as they were done.
That is what pmap does. Whenever a worker completes something, it’ll be handed another batch of assignments if any are left, thus balancing the workload among computer’s resources.

What if you want the best of both worlds for your particular task? E.g., maybe you want workers to receive batches of 400 at a time. That is what the batch_size argument is for:

help?> pmap
search: pmap promote_shape typemax PermutedDimsArray process_messages

  pmap(f, [::AbstractWorkerPool], c...; distributed=true, batch_size=1, on_error=nothing, retry_delays=[], retry_check=nothing) -> collection

In this way, you could tune pmap to have the behavior you want for a particular problem.
Of course, for something as cheap and consistent as squaring, I think just using @distributed (or batch_size = cld(length(iterable), nworkers())) is ideal.

9 Likes

Thanks, that was quite helpful :slightly_smiling_face: