Hello,
I have been trying to run some jobs on a Slurm cluster (specifically, the canadian Narval Cluster of the Research Digital Alliance).
Possibly due to my inexperience with it, I encountered some issues (I am not opening a bug report because I might just be using something in a wrong manner).
Specifically: after resources allocation, the job is unable to add the Slurm processes.
Here is the sbatch file I use
#!/bin/bash
#SBATCH --account=rrg-wperciva
#SBATCH --job-name=JuliaTuringMCMC_Adj
#SBATCH --output=julia_turing_%j.out
#SBATCH --error=julia_turing_%j.err
#SBATCH --time=00:10:00
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2G
module load julia/1.11.3
export JULIA_NUM_THREADS=$M_THREADS
export JULIA_PROJECT=/home/mbonici/turing_benchmarks
echo "--- Testing srun for multiple tasks ---"
srun hostname
echo "--- End srun test ---"
srun julia /home/mbonici/turing_benchmarks/your_turing_script.jl
Here is the script I run
using Distributed: addprocs, workers, nworkers, remotecall_fetch, @everywhere
using SlurmClusterManager: SlurmManager
using Turing
# Add these lines to check environment variables
println("--- Debugging Environment Variables ---")
println("SLURM_JOBID: ", get(ENV, "SLURM_JOBID", "Not set"))
println("SLURM_NTASKS: ", get(ENV, "SLURM_NTASKS", "Not set"))
println("SLURM_CPUS_PER_TASK: ", get(ENV, "SLURM_CPUS_PER_TASK", "Not set"))
println("JULIA_PROJECT: ", get(ENV, "JULIA_PROJECT", "Not set"))
println("PATH: ", get(ENV, "PATH", "Not set"))
println("--- End Debugging Environment Variables ---\n")
const SLURM_CPUS_PER_TASK = parse(Int, ENV["SLURM_CPUS_PER_TASK"])
const JULIA_PROJECT_PATH = ENV["JULIA_PROJECT"]
println("SLURM allocated processes, each with $(SLURM_CPUS_PER_TASK) CPU cores.")
println("Julia project path: $JULIA_PROJECT_PATH")
exeflags = ["--project=$JULIA_PROJECT_PATH", "-t $SLURM_CPUS_PER_TASK"]
println("Attempting to add Julia worker processes using SlurmClusterManager...")
try
# You can add a timeout parameter if SlurmClusterManager.jl supports it
# E.g., addprocs(SlurmManager(); exeflags=exeflags, timeout=300) for 5 minutes
addprocs(SlurmManager(launch_timeout=600.0))
println("Successfully added $(nworkers()) worker processes.")
catch e
println(stderr, "Error adding workers: ", e)
Base.showerror(stderr, e, catch_backtrace())
exit(1) # Exit the job if worker addition fails
end
In the end, it does not work: it takes too long to add the Slurm processes.
Here the output I get
--- Testing srun for multiple tasks ---
nc31146
nc31129
--- End srun test ---
--- Debugging Environment Variables ---
SLURM_JOBID: 45083109
SLURM_NTASKS: 2
SLURM_CPUS_PER_TASK: 2
JULIA_PROJECT: /home/mbonici/turing_benchmarks
PATH: /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/julia/1.11.3/bin:/home/mbonici/.local/bin:/home/mbonici/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/mii/1.1.2/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/flexiblascore/3.3.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcc12/openmpi/4.1.5/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucc/1.2.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/pmix/4.2.4/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/libfabric/1.18.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucx/1.14.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/sbin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/x86_64-pc-linux-gnu/gcc-bin/12:/cvmfs/soft.computecanada.ca/easybuild/bin:/cvmfs/soft.computecanada.ca/custom/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/bin:/cvmfs/soft.computecanada.ca/custom/bin/computecanada:/opt/software/slurm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin
--- End Debugging Environment Variables ---
SLURM allocated processes, each with 2 CPU cores.
Julia project path: /home/mbonici/turing_benchmarks
Attempting to add Julia worker processes using SlurmClusterManager...
--- Debugging Environment Variables ---
SLURM_JOBID: 45083109
SLURM_NTASKS: 2
SLURM_CPUS_PER_TASK: 2
JULIA_PROJECT: /home/mbonici/turing_benchmarks
PATH: /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/julia/1.11.3/bin:/home/mbonici/.local/bin:/home/mbonici/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/mii/1.1.2/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/flexiblascore/3.3.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcc12/openmpi/4.1.5/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucc/1.2.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/pmix/4.2.4/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/libfabric/1.18.0/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/ucx/1.14.1/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/sbin:/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/hwloc/2.9.1/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/x86_64-pc-linux-gnu/gcc-bin/12:/cvmfs/soft.computecanada.ca/easybuild/bin:/cvmfs/soft.computecanada.ca/custom/bin:/cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/bin:/cvmfs/soft.computecanada.ca/custom/bin/computecanada:/opt/software/slurm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin
--- End Debugging Environment Variables ---
SLURM allocated processes, each with 2 CPU cores.
Julia project path: /home/mbonici/turing_benchmarks
Attempting to add Julia worker processes using SlurmClusterManager...
I also tried, as another user found useful, to extend the waiting time, but it did not help.
Any suggestion? Is there anything I am doing wrong?
Edit: cc @dilumaluthge and @jkleinh .