Using pmap: Do all workers have to be equally capable?

I am using pmap with workers from many computers. Some of those computers have more resources than others ~ does pmap know this? And distribute tasks accordingly? e.g. if I have 10 workers on one machine vs 2 workers on another machine (and the machines are identical, say) – does it take into account that the 10 workers that share a single RAM cache will be moving more slowly that the machine with only 2 workers? Or is it up to me to “worker balance” and create workers with similar amounts of RAM/worker and such? To get the best performance?

Just trying to get a gauge for how much thinking I have to do when using pmap vs. pmap does for me.

pmap does implement load balancing that is every working will do a unit of work and then come back and ask “is there more that I can do?”. More capable workers will return faster and will thus, effectively, do more total work.

However, pmap doesn’t know how “capable” your workers are and doesn’t explicitly schedule things based on, say, total memory or CPU details etc. Note that even if that information was available such a scheduling would be almost impossible to implement generically: how is pmap supposed to know what the computational bottleneck of the provided function (to be run in parallel) is? It could be memory bandwidth. Or CPU performance. Or something else entirely. So there isn’t a generic notion of “capable”.

2 Likes

You could also use Dagger.jl which works similarly to pmap, but is much smarter thanks to having a true scheduler. We don’t yet have support for measuring overall performance of each logical worker/processor, but we can definitely implement that. We already collect many metrics at runtime to tune the scheduler’s decisions, and that one could prove very beneficial for heterogeneous clusters. Feel free to file an issue and ping me if you’re interested in this!

2 Likes

Interesting thank you both! Great answers.
I think part of my difficulty is that each “unit of work” that I distribute to pmap involves calling BLAS for linear programming, which does its own parallelization. As a result it’s a bit difficult to choose the number of workers per machine because I don’t know how many processes BLAS will spawn. And I have found that sometimes using more workers actually slows things down if there is too much resource competition on a single machine. E.g. if there are 28 cores and I have 10 julia workers and each julia worker tries to spawn another 28 BLAS workers == too much resource competition. I almost wish there was some parallelization library for julia that would automatically choose how many julia workers to spawn based upon how much of the machine is free at a given moment. And use fewer julia workers if everything is tied up.

1 Like

I would recommend doing BLAS.set_num_threads(1) to ensure predictable performance and better parallelization via pmap or Dagger. It can potentially limit parallelism if your work is too coarse-grained, but if you have enough small-ish BLAS calls, you should end up with less cross-worker compute and memory contention.

I almost wish there was some parallelization library for julia that would automatically choose how many julia workers to spawn based upon how much of the machine is free at a given moment. And use fewer julia workers if everything is tied up.

This is something I plan to eventually add to Dagger. It’d require the user to specify how many workers should be spooled up total, when they should be added (1 min max load? 10 mins max load? something else?), and how to connect to them (if they require SSH or cluster commands).

1 Like

As mentioned by @carstenbauer , pmap does not pre-assign the tasks to the workers. Instead, the workers take a task from the queue whenever they become idle. In the parallel computing literature, this is known as pure self-scheduling (PSS), which, in theory, results in the best load balance. The downside is that every time a worker takes a task, communication overhead is introduced. This overhead increases linearly with the number of tasks. But the task management overhead should be lower than a full-blown task-scheduling framework like Dagger.jl.

In your case, seems like the work is already two-level. In that case, it’s ideal to let each worker operate all of its threads for BLAS, and let pmap handle the coarse tasks.