Distributed Julia, ClusterMangers package, Slurm, and CUDA on a shared file system

Hello,

I’m running Julia 1.7.3 on a cluster using Slurm to launch batch jobs. The cluster has a shared filesystem. I’m using the Slurm cluster manager from the ClusterManagers package to call addproc(...). The node I’m running the REPL on is a login node and does NOT have a GPU, but the procs returned from addproc() DO have GPUs.

When I run @fetechfrom procs[1] CUDA.ndevices() it returns 1. However, when I run @fetechfrom procs[1] CUDA.devices() it returns

Error showing value of type CUDA.DeviceIterator:
ERROR: CUDA error (code 100, CUDA_ERROR_NO_DEVICE)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89

Some other evidence that CUDA and the GPU’s are installed correctly on the remote hosts is if I run

srun --time=02:00:00 --partition=a100 --nodes=1 --cpus-per-task=16 --gpus-per-node=1 --pty /bin/bash
julia --project=MyProjDirWithCUDAPkg
julia> using CUDA
julia> CUDA.devices()
CUDA.DeviceIterator() for 1 devices:
0. NVIDIA A100 80GB PCIe

I’ve done this experiment and double checked that the host(s) returned from addproc(...) and srun bash are the same.

Any ideas why workers from Distributed (created from Slurm ClusterManagers) can’t use the GPUs on workers that clearly have them and is installed correctly?

this is probably trying to show your local CUDA device and you don’t have one. Can you try a simpler task that’s computational, something like

@fetechfrom procs[1] sum(CUDA.rand(10))

so that you’re using GPU but only transmitting “CPU-compatible” data over network

That worked! It returned 4.443556f0. Sorry I didn’t think to try that in the first place.

So does this mean there’s a bug in CUDA.devices()? Any suggestions on how to document this better and report it?

P.S. I’m asking the system admin it install Julia 1.8.3 for me right now. Maybe the bug is already fixed?

idk exactly but I imagine the problem is CUDA.devices() return pointers / reference to actual devices on the system it was called, when you transmit them back to your log in node, they’re wrong since the devices are not present on your login nodes, thus the error