Why is the parallel map so slow?

@distributed partitions the iteration space into a piece per worker, and then those workers handle many pieces.
pmap hands off the pieces one at a time.

Both approaches have pros and cons. Every bit of work delivered to a worker has overhead. That overhead is much higher than a simple operation like squaring a number. But, if the worker is given the task of squaring a huge amount of numbers (like would be the case with @distributed), that cost could be amortized over that huge amount.
However, lets say you have very few tasks to do, they each take a while (doesn’t have to be long for the overhead to be negligible!), but vary in just how long. Perhaps on top of that variation, the number of jobs to be done doesn’t divide well by the number of workers, e.g. 10 jobs and 8 workers.
If you hand these off to workers in advance, maybe the 4 slowest tasks end up with workers 1 and 2, while workers 5 and 8 receive relatively short tasks. To workers 5 and 8 finish soon, and then have nothing to do and sit idle. While workers 1 and 2 chug along for a long time on their first tasks before finally finishing them, after which they begin chugging along on their second task. What we would have wanted is 5 and 8 to be handed incomplete tasks as soon as they were done.
That is what pmap does. Whenever a worker completes something, it’ll be handed another batch of assignments if any are left, thus balancing the workload among computer’s resources.

What if you want the best of both worlds for your particular task? E.g., maybe you want workers to receive batches of 400 at a time. That is what the batch_size argument is for:

help?> pmap
search: pmap promote_shape typemax PermutedDimsArray process_messages

  pmap(f, [::AbstractWorkerPool], c...; distributed=true, batch_size=1, on_error=nothing, retry_delays=[], retry_check=nothing) -> collection

In this way, you could tune pmap to have the behavior you want for a particular problem.
Of course, for something as cheap and consistent as squaring, I think just using @distributed (or batch_size = cld(length(iterable), nworkers())) is ideal.

13 Likes