I have a pretty simple problem: I have about 50 tasks I would like to do simultaneously, and I would like to do exploit shared memory parallelism on each task. I have control of about 50 nodes with about 40 CPUs each, so I would like to assign 40 CPUs (all of the CPUs on a particular node) to each of these 50 tasks.
I would like do this with Julia’s native distributed/parallel capability. However, I haven’t been able to figure out how to do with the current abstractions. If I wanted to simply distribute the 50 tasks where a single core did each task, I know how: I would use a pmap. If I only had 1 task which I would like to use 40 shared memory CPUs on, I know how: I would use Julia’s use shared memory parallelism. But how do I solve this problem using Julia’s distributed/parallel abstractions? I would greatly appreciate your help!
Take a look at Multi-threaded worker processes - #6 by Pbellive and the other posts in this thread.
Thank you! I have a follow-up question: since
pmap only assigns to tasks to workers, wouldn’t I essentially be wasting an entire node with 40 CPUs when running a pmap in the outer loop? If so, is there any way to avoid this?
Well, you can start the workers wherever you want. The master process doesn’t have to be alone on that first node
Hmm I see, I’ll try to get that to work!
I have encountered one tricky complication when trying to mix parallelism with distributed. I am running many iterations of a complicated function, and inside the function there are two steps:
- A distributed for loop over around 10000 array values
- A parallel map over an array of size 50 where each map performs a complicated task that I would like to parallelize
The first distributed for loop does not suffer from latency issues, and would work fine without using shared memory. However, it seems like making the switch to shared memory parallelism instead of distributed parallelism for the map function of the second parallel map means that the first distributed for loop will no longer work. Because the first distributed for loop will “see” only 50 workers, instead of around 2500.
I suppose one solution is to make the first distributed for loop also nest parallelism within an outer distributed loop. But there was no need for this originally, and it would make my code a lot more complex; ideally, the first distributed for loop will see 2500 workers, one for each CPU, while the second parallel map would see 50 workers, all with a lot of threads. Is there any way to do this cleanly? Or does making num threads > 1 force me to rewrite all the code, even the parts that worked fine when completely distributed?
If your problem can be expressed in a functional manner (no mutation of values shared across tasks), then I’d recommend trying Dagger.jl instead.
Dagger is designed to support multithreading and multiprocessing simultaneously, and instead of having to manually express your problem in terms of threads or workers, you simply express your work as pure functions, and Dagger will seamlessly use whatever CPU resources are available in your Distributed cluster to execute your work (setup via
JULIA_NUM_THREADS, etc. as usual).
Dagger has an integrated scheduler which estimates processor utilization on remote workers, so it will try to schedule your work to minimize compute time, even if you’re executing functions with different signatures (and thus different expected run times).