Shared memory computation inside of each task of parallel map

tictaccat · November 4, 2021, 3:47am

Hi!

I have a pretty simple problem: I have about 50 tasks I would like to do simultaneously, and I would like to do exploit shared memory parallelism on each task. I have control of about 50 nodes with about 40 CPUs each, so I would like to assign 40 CPUs (all of the CPUs on a particular node) to each of these 50 tasks.

I would like do this with Julia’s native distributed/parallel capability. However, I haven’t been able to figure out how to do with the current abstractions. If I wanted to simply distribute the 50 tasks where a single core did each task, I know how: I would use a pmap. If I only had 1 task which I would like to use 40 shared memory CPUs on, I know how: I would use Julia’s use shared memory parallelism. But how do I solve this problem using Julia’s distributed/parallel abstractions? I would greatly appreciate your help!

carstenbauer · November 4, 2021, 8:04am

Take a look at Multi-threaded worker processes - #6 by Pbellive and the other posts in this thread.

tictaccat · November 4, 2021, 2:34pm

Thank you! I have a follow-up question: since pmap only assigns to tasks to workers, wouldn’t I essentially be wasting an entire node with 40 CPUs when running a pmap in the outer loop? If so, is there any way to avoid this?

carstenbauer · November 4, 2021, 3:29pm

Well, you can start the workers wherever you want. The master process doesn’t have to be alone on that first node

tictaccat · November 4, 2021, 3:59pm

Hmm I see, I’ll try to get that to work!

I have encountered one tricky complication when trying to mix parallelism with distributed. I am running many iterations of a complicated function, and inside the function there are two steps:

A distributed for loop over around 10000 array values
A parallel map over an array of size 50 where each map performs a complicated task that I would like to parallelize

The first distributed for loop does not suffer from latency issues, and would work fine without using shared memory. However, it seems like making the switch to shared memory parallelism instead of distributed parallelism for the map function of the second parallel map means that the first distributed for loop will no longer work. Because the first distributed for loop will “see” only 50 workers, instead of around 2500.

I suppose one solution is to make the first distributed for loop also nest parallelism within an outer distributed loop. But there was no need for this originally, and it would make my code a lot more complex; ideally, the first distributed for loop will see 2500 workers, one for each CPU, while the second parallel map would see 50 workers, all with a lot of threads. Is there any way to do this cleanly? Or does making num threads > 1 force me to rewrite all the code, even the parts that worked fine when completely distributed?

jpsamaroo · November 4, 2021, 5:01pm

If your problem can be expressed in a functional manner (no mutation of values shared across tasks), then I’d recommend trying Dagger.jl instead.

Dagger is designed to support multithreading and multiprocessing simultaneously, and instead of having to manually express your problem in terms of threads or workers, you simply express your work as pure functions, and Dagger will seamlessly use whatever CPU resources are available in your Distributed cluster to execute your work (setup via addprocs, JULIA_NUM_THREADS, etc. as usual).

Dagger has an integrated scheduler which estimates processor utilization on remote workers, so it will try to schedule your work to minimize compute time, even if you’re executing functions with different signatures (and thus different expected run times).

Topic		Replies	Views
Is Pmap _both_ distributed and threaded? Performance multithreading , distributed , pmap	4	450	December 28, 2021
Attaching workers to cores General Usage distributed	10	484	July 22, 2020
Pmap but with control over which workers do the tasks? Julia at Scale question	0	411	January 12, 2020
Lack of improvement from distributed pmap, understanding a simple example New to Julia distributed , pmap	6	155	October 29, 2024
Running a function over multiple nodes on a julia cluster, from within a julia script Julia at Scale hpc , cluster , distributed , slurm	5	215	August 14, 2024

Shared memory computation inside of each task of parallel map

Related topics