Addprocs_slurm not connecting to all available workers

I’m using Distributed and ClusterManagers to run large numbers of simulations in parallel on my university HPC cluster. Scheduling and job management on the cluster is handled by Slurm.

I initialise each of the parallel processes with the following:

using Distributed, ClusterManagers

available_workers = parse(Int, ENV["SLURM_NTASKS"])

addprocs_slurm(available_workers; topology = :master_worker)

However my job times-out in the addprocs_slurm step. Specifically I schedule my job using sbatch and the following .sh file:

#!/bin/bash
# set the number of nodes.
#SBATCH --nodes=32
# set the number of CPUs required.
#SBATCH --ntasks-per-node=48
# set the amount of memory needed for each CPU.
#SBATCH --mem-per-cpu=8000
# set max wallclock time (hh:mm:ss).
#SBATCH --time=5:00:00
# set the time partition for the job. 
#SBATCH --partition=short
# set name of job (AND DATE!)
#SBATCH --job-name=my_job_name
# mail alert at start, end, and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=my_email_here
# run the application

module load Julia/1.8.2-linux-x86_64

julia requirements.jl
julia my_job.jl > my_job.log

requirements.jl is a file containing all the dependencies for my simulations, and looks like:

using Pkg 

dependencies = [# list of packages used here]

Pkg.add(dependencies)

my_job.jl contains the code shown at the top of my post, followed by the code to execute my simulations.

If I open my_job.log it looks like:

connecting to worker 1 out of 1536
connecting to worker 2 out of 1536
# 884 lines omitted
connecting to worker 887 out of 1536
# No more lines after here

So it looks like addprocs_slurm is getting stuck somehow and my job is timing out in that step waiting for my job to connect to all the workers.

I’m kind of at a loss as to where to begin debugging this. Does anyone have any suggestions?

*The version of Julia is 1.8.2. This specific job is part of corrections for a paper I’m working on, the previous simulations for which were also run in 1.8.2, so I’m not really willing to upgrade the version unless I can avoid it.

Looks like this is an open issue with ClusterManagers.jl: Slurm broken · Issue #196 · JuliaParallel/ClusterManagers.jl · GitHub

Why are you using sbatch? What happens if you use SlurmManager directly inside julia? Something like

my_job.jl:
include("requirements.jl")
function my_long_running_function()
end

and create a run.jl file with

using ClusterManagers, Distributed 
addprocs(SlurmManager(n_procs), topology = :master_worker) 

@everywhere include("my_job.jl")

function run() 
   results = pmap(1:n_sims) do x 
        my_long_running_function()      
   end
end
1 Like

Why are you using sbatch

My cluster prefers use of salloc and sbatch for requesting resources. Basically all of the computing nodes on the cluster are non-interactive so I thought that sbatch was more apt.

What happens if you use SlurmManager directly inside julia?

I haven’t tried this, but will give it a go to see if it replicates the error. I am struggling to replicate the error to be honest (or at least get a MWE). It’s happened twice when I try to run large numbers of tasks but I can’t achieve the same error on the smaller number of CPUs available on the debugging/interactive nodes.

If you use SlurmManager within Julia’s addproc, it will internally construct and run an srun which is the same as sbatch really. Try it and let us know the results.

Alterantively for your workflow there is GitHub - kleinhenz/SlurmClusterManager.jl: julia package for running code on slurm clusters which connects workers to already allocated resources instead of allocating new ones.

2 Likes

Thanks for the suggestion. I need to test it at scale but on small ntasks it seems like

using Distributed, SlurmClusterManager

addprocs(SlurmManager(); topology=:master_worker)

runs as expected.

1 Like

Have now tested this at scale, and yes it fixes the problem

1 Like