I’m using Distributed
and ClusterManagers
to run large numbers of simulations in parallel on my university HPC cluster. Scheduling and job management on the cluster is handled by Slurm.
I initialise each of the parallel processes with the following:
using Distributed, ClusterManagers
available_workers = parse(Int, ENV["SLURM_NTASKS"])
addprocs_slurm(available_workers; topology = :master_worker)
However my job times-out in the addprocs_slurm
step. Specifically I schedule my job using sbatch
and the following .sh
file:
#!/bin/bash
# set the number of nodes.
#SBATCH --nodes=32
# set the number of CPUs required.
#SBATCH --ntasks-per-node=48
# set the amount of memory needed for each CPU.
#SBATCH --mem-per-cpu=8000
# set max wallclock time (hh:mm:ss).
#SBATCH --time=5:00:00
# set the time partition for the job.
#SBATCH --partition=short
# set name of job (AND DATE!)
#SBATCH --job-name=my_job_name
# mail alert at start, end, and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=my_email_here
# run the application
module load Julia/1.8.2-linux-x86_64
julia requirements.jl
julia my_job.jl > my_job.log
requirements.jl
is a file containing all the dependencies for my simulations, and looks like:
using Pkg
dependencies = [# list of packages used here]
Pkg.add(dependencies)
my_job.jl
contains the code shown at the top of my post, followed by the code to execute my simulations.
If I open my_job.log
it looks like:
connecting to worker 1 out of 1536
connecting to worker 2 out of 1536
# 884 lines omitted
connecting to worker 887 out of 1536
# No more lines after here
So it looks like addprocs_slurm
is getting stuck somehow and my job is timing out in that step waiting for my job to connect to all the workers.
I’m kind of at a loss as to where to begin debugging this. Does anyone have any suggestions?
*The version of Julia is 1.8.2. This specific job is part of corrections for a paper I’m working on, the previous simulations for which were also run in 1.8.2, so I’m not really willing to upgrade the version unless I can avoid it.