What is the correct way to use multiple GPUs in Slurm cluster?

Hello Julia community,

I am trying to distribute my computation using multiple GPUs in Slurm cluster. The code can be executed correctly while I’m using single GPU board in Slurm. However, when I try to use more than one GPU, I got Segmentation fault from CUDA.

I try to figure out the issue through internet, but there’s not much resources about this topic. So I post a question here to look for help.

To simplify my code, I would like to call the complicated problem that I need to solve as:

val = solve_prob(p, q)

with input parameters p and q, and also the returned val::ComplexF64. This function solve_prob uses CUDA without scalar indexing.

Because I have many different cases of ps and qs that I need to run through, I want to distribute it using multiple GPUs. My idea is to create (for example) 4 Slurm tasks and also use one GPU board per task. The followings are my Slurm and Julia scripts:

script.sh:

#!/bin/bash
### job name and time settings ###
#SBATCH --job-name=test
#SBATCH --time=1:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#
### Node and task settings ###
#SBATCH --nodes=1
#SBATCH --ntasks=4
#
### CPU settings ###
#SBATCH --cpus-per-task=1
#
### GPU settings ###
#SBATCH --gpus-per-task=1   # each task gets 1 GPU
#SBATCH --gpu-bind=single:1 # bind 1 GPUs per task
####################

export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK}

/path/to/julia script.jl

Note that I only use 1 node, and bind 1 GPU per task.

script.jl:

using Distributed, SharedArrays, SlurmClusterManager

#############################
# Add workers (matches Slurm tasks)
addprocs(SlurmManager())

#############################
# Load packages for master and workers
@everywhere begin
    # using packages I need instead of CUDA 
end

#############################
# Load CUDA only for workers [myid() != 1]
@everywhere if myid() != 1
    using CUDA
    CUDA.device!(0)  # each worker sees only its assigned GPU
    CUDA.allowscalar(false)
end

#############################
# Task and Model setup for workers
@everywhere begin
    # skip defining solve_prob for master since CUDA is not loaded
    if myid() != 1
        function solve_prob(p, q)
            # complicated ...
        end
    end
    
    Np = 1000 # number of parameter p
    plist = rand(ComplexF64, Np)

    Nq = 800 # number of parameter q
    qlist = rand(ComplexF64, Nq)    
end

#############################
# Distributed loop to workers
results = SharedArray{ComplexF64}(Np, Nq)
@sync @distributed for j in eachindex(qlist)
    for i in eachindex(plist)
        val = solve_prob(plist[i], qlist[j])
        results[i, j] = val

        println("$i, $j")
        flush(stdout)
    end
end

# save results ...

rmprocs(workers()) # remove workers

Note that I only distributed for qlist, and the for-loop of plist will run locally in each workers. In this way, I think the results[i,j] = val will not have race condition.

After I submit this job to Slurm, I noticed that it can run for couple iterations in the beginning, and then, at some point, some of the tasks throw Segmentation fault, for example:

srun: error: hostname: task 2: Segmentation fault (core dumped)
srun: error: hostname: task 0: Segmentation fault (core dumped)

And the segmentation fault happens at different iteration step if I re-submit the job again.

I was also thinking that the possibility of this segmentation fault originates from my solve_prob function. But this function only allocate CuArrays locally in each worker, and doesn’t access other worker’s variable. So I think it should not be the reason causing segmentation fault.