SlurmClusterManagers timeout

Hello,
I have been trying to run some jobs on a Slurm cluster (specifically, the canadian Narval Cluster of the Research Digital Alliance).

Possibly due to my inexperience with it, I encountered some issues (I am not opening a bug report because I might just be using something in a wrong manner).

Specifically: after resources allocation, the job is unable to add the Slurm processes.

Here is the sbatch file I use

#!/bin/bash
#SBATCH --account=rrg-wperciva
#SBATCH --job-name=JuliaTuringMCMC_Adj
#SBATCH --output=julia_turing_%j.out
#SBATCH --error=julia_turing_%j.err
#SBATCH --time=00:10:00

#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2G

module load julia/1.11.3
export JULIA_NUM_THREADS=$M_THREADS
export JULIA_PROJECT=/home/mbonici/turing_benchmarks

echo "--- Testing srun for multiple tasks ---"
srun hostname
echo "--- End srun test ---"

srun julia /home/mbonici/turing_benchmarks/your_turing_script.jl

Here is the script I run

using Distributed: addprocs, workers, nworkers, remotecall_fetch, @everywhere
using SlurmClusterManager: SlurmManager
using Turing
# Add these lines to check environment variables
println("--- Debugging Environment Variables ---")
println("SLURM_JOBID: ", get(ENV, "SLURM_JOBID", "Not set"))
println("SLURM_NTASKS: ", get(ENV, "SLURM_NTASKS", "Not set"))
println("SLURM_CPUS_PER_TASK: ", get(ENV, "SLURM_CPUS_PER_TASK", "Not set"))
println("JULIA_PROJECT: ", get(ENV, "JULIA_PROJECT", "Not set"))
println("PATH: ", get(ENV, "PATH", "Not set"))
println("--- End Debugging Environment Variables ---\n")

const SLURM_CPUS_PER_TASK = parse(Int, ENV["SLURM_CPUS_PER_TASK"])
const JULIA_PROJECT_PATH = ENV["JULIA_PROJECT"]

println("SLURM allocated processes, each with $(SLURM_CPUS_PER_TASK) CPU cores.")
println("Julia project path: $JULIA_PROJECT_PATH")

exeflags = ["--project=$JULIA_PROJECT_PATH", "-t $SLURM_CPUS_PER_TASK"]

println("Attempting to add Julia worker processes using SlurmClusterManager...")
try
    # You can add a timeout parameter if SlurmClusterManager.jl supports it
    # E.g., addprocs(SlurmManager(); exeflags=exeflags, timeout=300) for 5 minutes
    addprocs(SlurmManager(launch_timeout=600.0))
    println("Successfully added $(nworkers()) worker processes.")
catch e
    println(stderr, "Error adding workers: ", e)
    Base.showerror(stderr, e, catch_backtrace())
    exit(1) # Exit the job if worker addition fails
end

In the end, it does not work: it takes too long to add the Slurm processes.

Here the output I get

--- Testing srun for multiple tasks ---
nc31146
nc31129
--- End srun test ---
--- Debugging Environment Variables ---
SLURM_JOBID: 45083109
SLURM_NTASKS: 2
SLURM_CPUS_PER_TASK: 2
JULIA_PROJECT: /home/mbonici/turing_benchmarks
PATH: /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/julia/1.11.3/bin:/home/mbonici/.local/bin:/home/mbonici/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/mii/1.1.2/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/flexiblascore/3.3.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcc12/openmpi/4.1.5/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucc/1.2.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/pmix/4.2.4/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/libfabric/1.18.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucx/1.14.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/sbin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/x86_64-pc-linux-gnu/gcc-bin/12:/cvmfs/soft.computecanada.ca/easybuild/bin:/cvmfs/soft.computecanada.ca/custom/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/bin:/cvmfs/soft.computecanada.ca/custom/bin/computecanada:/opt/software/slurm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin
--- End Debugging Environment Variables ---

SLURM allocated processes, each with 2 CPU cores.
Julia project path: /home/mbonici/turing_benchmarks
Attempting to add Julia worker processes using SlurmClusterManager...
--- Debugging Environment Variables ---
SLURM_JOBID: 45083109
SLURM_NTASKS: 2
SLURM_CPUS_PER_TASK: 2
JULIA_PROJECT: /home/mbonici/turing_benchmarks
PATH: /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/julia/1.11.3/bin:/home/mbonici/.local/bin:/home/mbonici/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/mii/1.1.2/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/flexiblascore/3.3.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcc12/openmpi/4.1.5/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucc/1.2.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/pmix/4.2.4/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/libfabric/1.18.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucx/1.14.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/sbin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/x86_64-pc-linux-gnu/gcc-bin/12:/cvmfs/soft.computecanada.ca/easybuild/bin:/cvmfs/soft.computecanada.ca/custom/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/bin:/cvmfs/soft.computecanada.ca/custom/bin/computecanada:/opt/software/slurm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin
--- End Debugging Environment Variables ---

SLURM allocated processes, each with 2 CPU cores.
Julia project path: /home/mbonici/turing_benchmarks
Attempting to add Julia worker processes using SlurmClusterManager...

I also tried, as another user found useful, to extend the waiting time, but it did not help.
Any suggestion? Is there anything I am doing wrong?

Edit: cc @dilumaluthge and @jkleinh .

usually how I debug this is I fire up an interactive node and manually run the payload and see where it stuck / when it cannot connect back to the login node.

One failure mode I have encountered is Distributed worker manager doesn't use socket connection to infer worker ip · Issue #85 · JuliaLang/Distributed.jl · GitHub

1 Like

I’ll try that.
(I think this thread will elucidate my ignorance on this topic.)