Addprocs_slurm not connecting to all available workers

jewh · November 27, 2024, 9:04am

I’m using Distributed and ClusterManagers to run large numbers of simulations in parallel on my university HPC cluster. Scheduling and job management on the cluster is handled by Slurm.

I initialise each of the parallel processes with the following:

using Distributed, ClusterManagers

available_workers = parse(Int, ENV["SLURM_NTASKS"])

addprocs_slurm(available_workers; topology = :master_worker)

However my job times-out in the addprocs_slurm step. Specifically I schedule my job using sbatch and the following .sh file:

#!/bin/bash
# set the number of nodes.
#SBATCH --nodes=32
# set the number of CPUs required.
#SBATCH --ntasks-per-node=48
# set the amount of memory needed for each CPU.
#SBATCH --mem-per-cpu=8000
# set max wallclock time (hh:mm:ss).
#SBATCH --time=5:00:00
# set the time partition for the job. 
#SBATCH --partition=short
# set name of job (AND DATE!)
#SBATCH --job-name=my_job_name
# mail alert at start, end, and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=my_email_here
# run the application

module load Julia/1.8.2-linux-x86_64

julia requirements.jl
julia my_job.jl > my_job.log

requirements.jl is a file containing all the dependencies for my simulations, and looks like:

using Pkg 

dependencies = [# list of packages used here]

Pkg.add(dependencies)

my_job.jl contains the code shown at the top of my post, followed by the code to execute my simulations.

If I open my_job.log it looks like:

connecting to worker 1 out of 1536
connecting to worker 2 out of 1536
# 884 lines omitted
connecting to worker 887 out of 1536
# No more lines after here

So it looks like addprocs_slurm is getting stuck somehow and my job is timing out in that step waiting for my job to connect to all the workers.

I’m kind of at a loss as to where to begin debugging this. Does anyone have any suggestions?

*The version of Julia is 1.8.2. This specific job is part of corrections for a paper I’m working on, the previous simulations for which were also run in 1.8.2, so I’m not really willing to upgrade the version unless I can avoid it.

jewh · November 28, 2024, 5:34pm

Looks like this is an open issue with ClusterManagers.jl: Slurm broken · Issue #196 · JuliaParallel/ClusterManagers.jl · GitHub

affans · November 28, 2024, 10:57pm

Why are you using sbatch? What happens if you use SlurmManager directly inside julia? Something like

my_job.jl:
include("requirements.jl")
function my_long_running_function()
end

and create a run.jl file with

using ClusterManagers, Distributed 
addprocs(SlurmManager(n_procs), topology = :master_worker) 

@everywhere include("my_job.jl")

function run() 
   results = pmap(1:n_sims) do x 
        my_long_running_function()      
   end
end

jewh · November 28, 2024, 11:23pm

Why are you using sbatch

My cluster prefers use of salloc and sbatch for requesting resources. Basically all of the computing nodes on the cluster are non-interactive so I thought that sbatch was more apt.

What happens if you use SlurmManager directly inside julia?

I haven’t tried this, but will give it a go to see if it replicates the error. I am struggling to replicate the error to be honest (or at least get a MWE). It’s happened twice when I try to run large numbers of tasks but I can’t achieve the same error on the smaller number of CPUs available on the debugging/interactive nodes.

affans · November 29, 2024, 2:39am

If you use SlurmManager within Julia’s addproc, it will internally construct and run an srun which is the same as sbatch really. Try it and let us know the results.

fabiangans · November 29, 2024, 9:37am

Alterantively for your workflow there is GitHub - kleinhenz/SlurmClusterManager.jl: julia package for running code on slurm clusters which connects workers to already allocated resources instead of allocating new ones.

jewh · November 29, 2024, 11:11am

Thanks for the suggestion. I need to test it at scale but on small ntasks it seems like

using Distributed, SlurmClusterManager

addprocs(SlurmManager(); topology=:master_worker)

runs as expected.

jewh · December 4, 2024, 7:17am

Have now tested this at scale, and yes it fixes the problem

Topic		Replies	Views
SlurmClusterManagers timeout Julia at Scale slurm	2	47	June 19, 2025
Timeout issues on slurm cluster Julia at Scale question , parallel , cluster , distributed	7	2160	June 4, 2025
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1399	May 19, 2022
Interminnent errors throwing when adding procs General Usage	1	295	September 5, 2022
I am unable to run a simple distributed.jl code on my slurm cluster Julia at Scale parallel , distributed , slurm	11	643	February 10, 2024

Addprocs_slurm not connecting to all available workers

Related topics