How to assign a device for each worker correctly in a multi gpu multi node scenario?

Hello folks,

Does someone know how to correctly set the GPUs in each node correctly to individual workers in those nodes? For instance, imagine I have 2 nodes, with each having 2 GPUs. If I try to use the same procedure as detailed in the CUDA.jl documentation I imagine that the devices() function will return only the devices for the master node, while workers() will return all the workers from all nodes.

If I was using MPI.jl, I would group the workers by node, giving an intra-node id to each worker. One example of this would be how the package ImplicitGlobalGrid.jl select the correct device in each worker, and using this intra-node id to select the devices.

How can I do something similar with Distributed? I want to have an intra-node id for each worker, so I can do this device selection too.

Thanks for the help.

3 Likes

Are you using a cluster with SLURM? If so, there is an option to bind a single GPU to each task (i.e. each process) to make sure each process gets its own GPU.

If not, you can always use device! from CUDA to set the GPU on each process based on myid from Distributed but this is a bit more involved.

Oh thanks, I didn’t know about the –gpus-per-socket option for SLURM. I imagine that at least for SLURM cluster this is the solution.

Nevertheless, this myid(), at least to my understanding, gives a global id for the worker, not an intra-node id. How could I use device! using it @jmair ?

I think --gpu-bind=single:1 is a better option, as you can have multiple processes on the same node if it has multiple GPUs, but 1GPU per process and you don’t need to do anything in Julia but use the GPU like normal.

myid() is global yes, but I assume (don’t know for sure) that processes are the same node are adjacent so that you can use device_id=myid() % num_gpus + 1 and use device!(devices()[device_id]) for example. I haven’t tested this method and wouldn’t recommend it but it could be a starting point to try something out. Using the SLURM option is much more ergonomic and I would recommend it.

1 Like

Oh thank you for the clarification. The reason I am persuing an option that doesn’t rely on slurm is because I want to be able to write a code that uses both the GPUs and CPUs for compute. That way I would assign a gpu to each of the first workers in the node, and a few more workers that would some cpu stuff using Threads.

I will test the second approach you mentioned.

Thank you!

A shorter form of this is

CUDA.device!(myid() % length(CUDA.devices()))

Note that this should work in both cases: if you use device binding, then length(CUDA.devices()) == 1, and so you will always select device 0. If you don’t, then it will assign them in a round-robin manner (and oversubscribe them fairly evenly if you have more procs than devices per node). It only relies on the ids being sequential per node.

1 Like

oh this is cool! thanks @simonbyrne