Hello Julia community,
I am trying to distribute my computation using multiple GPUs in Slurm cluster. The code can be executed correctly while I’m using single GPU board in Slurm. However, when I try to use more than one GPU, I got Segmentation fault
from CUDA
.
I try to figure out the issue through internet, but there’s not much resources about this topic. So I post a question here to look for help.
To simplify my code, I would like to call the complicated problem that I need to solve as:
val = solve_prob(p, q)
with input parameters p
and q
, and also the returned val::ComplexF64
. This function solve_prob
uses CUDA
without scalar indexing.
Because I have many different cases of p
s and q
s that I need to run through, I want to distribute it using multiple GPUs. My idea is to create (for example) 4
Slurm tasks and also use one GPU board per task. The followings are my Slurm and Julia scripts:
script.sh
:
#!/bin/bash
### job name and time settings ###
#SBATCH --job-name=test
#SBATCH --time=1:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#
### Node and task settings ###
#SBATCH --nodes=1
#SBATCH --ntasks=4
#
### CPU settings ###
#SBATCH --cpus-per-task=1
#
### GPU settings ###
#SBATCH --gpus-per-task=1 # each task gets 1 GPU
#SBATCH --gpu-bind=single:1 # bind 1 GPUs per task
####################
export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK}
/path/to/julia script.jl
Note that I only use 1 node, and bind 1 GPU per task.
script.jl
:
using Distributed, SharedArrays, SlurmClusterManager
#############################
# Add workers (matches Slurm tasks)
addprocs(SlurmManager())
#############################
# Load packages for master and workers
@everywhere begin
# using packages I need instead of CUDA
end
#############################
# Load CUDA only for workers [myid() != 1]
@everywhere if myid() != 1
using CUDA
CUDA.device!(0) # each worker sees only its assigned GPU
CUDA.allowscalar(false)
end
#############################
# Task and Model setup for workers
@everywhere begin
# skip defining solve_prob for master since CUDA is not loaded
if myid() != 1
function solve_prob(p, q)
# complicated ...
end
end
Np = 1000 # number of parameter p
plist = rand(ComplexF64, Np)
Nq = 800 # number of parameter q
qlist = rand(ComplexF64, Nq)
end
#############################
# Distributed loop to workers
results = SharedArray{ComplexF64}(Np, Nq)
@sync @distributed for j in eachindex(qlist)
for i in eachindex(plist)
val = solve_prob(plist[i], qlist[j])
results[i, j] = val
println("$i, $j")
flush(stdout)
end
end
# save results ...
rmprocs(workers()) # remove workers
Note that I only distributed for qlist
, and the for
-loop of plist
will run locally in each workers. In this way, I think the results[i,j] = val
will not have race condition.
After I submit this job to Slurm, I noticed that it can run for couple iterations in the beginning, and then, at some point, some of the tasks throw Segmentation fault, for example:
srun: error: hostname: task 2: Segmentation fault (core dumped)
srun: error: hostname: task 0: Segmentation fault (core dumped)
And the segmentation fault happens at different iteration step if I re-submit the job again.
I was also thinking that the possibility of this segmentation fault originates from my solve_prob
function. But this function only allocate CuArray
s locally in each worker, and doesn’t access other worker’s variable. So I think it should not be the reason causing segmentation fault.