Distributed Julia, ClusterMangers package, Slurm, and CUDA on a shared file system

asaxton · November 22, 2022, 7:19pm

Hello,

I’m running Julia 1.7.3 on a cluster using Slurm to launch batch jobs. The cluster has a shared filesystem. I’m using the Slurm cluster manager from the ClusterManagers package to call addproc(...). The node I’m running the REPL on is a login node and does NOT have a GPU, but the procs returned from addproc() DO have GPUs.

When I run @fetechfrom procs[1] CUDA.ndevices() it returns 1. However, when I run @fetechfrom procs[1] CUDA.devices() it returns

Error showing value of type CUDA.DeviceIterator:
ERROR: CUDA error (code 100, CUDA_ERROR_NO_DEVICE)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89

Some other evidence that CUDA and the GPU’s are installed correctly on the remote hosts is if I run

srun --time=02:00:00 --partition=a100 --nodes=1 --cpus-per-task=16 --gpus-per-node=1 --pty /bin/bash
julia --project=MyProjDirWithCUDAPkg
julia> using CUDA
julia> CUDA.devices()
CUDA.DeviceIterator() for 1 devices:
0. NVIDIA A100 80GB PCIe

I’ve done this experiment and double checked that the host(s) returned from addproc(...) and srun bash are the same.

Any ideas why workers from Distributed (created from Slurm ClusterManagers) can’t use the GPUs on workers that clearly have them and is installed correctly?

jling · November 22, 2022, 7:32pm

this is probably trying to show your local CUDA device and you don’t have one. Can you try a simpler task that’s computational, something like

@fetechfrom procs[1] sum(CUDA.rand(10))

so that you’re using GPU but only transmitting “CPU-compatible” data over network

asaxton · November 22, 2022, 7:39pm

That worked! It returned 4.443556f0. Sorry I didn’t think to try that in the first place.

So does this mean there’s a bug in CUDA.devices()? Any suggestions on how to document this better and report it?

asaxton · November 22, 2022, 7:39pm

P.S. I’m asking the system admin it install Julia 1.8.3 for me right now. Maybe the bug is already fixed?

jling · November 22, 2022, 8:27pm

idk exactly but I imagine the problem is CUDA.devices() return pointers / reference to actual devices on the system it was called, when you transmit them back to your log in node, they’re wrong since the devices are not present on your login nodes, thus the error

Topic		Replies	Views
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2217	January 3, 2018
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3644	May 20, 2020
CUDAnative use multiple GPUs GPU gpu , cudanative , parallel	5	1796	March 24, 2018
A few resources to help enable using Distributed.jl on a SLURM cluster Specific Domains cluster , distributed , slurm , tip	0	395	May 30, 2022
How to assign a device for each worker correctly in a multi gpu multi node scenario? Julia at Scale question , distributed	6	1106	March 9, 2023

Distributed Julia, ClusterMangers package, Slurm, and CUDA on a shared file system

Related topics