Hello,
I’m running Julia 1.7.3 on a cluster using Slurm to launch batch jobs. The cluster has a shared filesystem. I’m using the Slurm cluster manager from the ClusterManagers package to call addproc(...)
. The node I’m running the REPL on is a login node and does NOT have a GPU, but the procs
returned from addproc()
DO have GPUs.
When I run @fetechfrom procs[1] CUDA.ndevices()
it returns 1
. However, when I run @fetechfrom procs[1] CUDA.devices()
it returns
Error showing value of type CUDA.DeviceIterator:
ERROR: CUDA error (code 100, CUDA_ERROR_NO_DEVICE)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
Some other evidence that CUDA and the GPU’s are installed correctly on the remote hosts is if I run
srun --time=02:00:00 --partition=a100 --nodes=1 --cpus-per-task=16 --gpus-per-node=1 --pty /bin/bash
julia --project=MyProjDirWithCUDAPkg
julia> using CUDA
julia> CUDA.devices()
CUDA.DeviceIterator() for 1 devices:
0. NVIDIA A100 80GB PCIe
I’ve done this experiment and double checked that the host(s) returned from addproc(...)
and srun bash
are the same.
Any ideas why workers from Distributed (created from Slurm ClusterManagers) can’t use the GPUs on workers that clearly have them and is installed correctly?