How to parallel Julia on multiple nodes on HPC (slurm)?

The parallel part of my Julia script utilizes the Distributed.jl package and the remotecall_fetch function. (I am not using MPI.) In short, it calculates many linear programming problems in parallel independently. (Here is the sample code.) On my HPC, each node has 40 CPUs and 2 hyper-threads per CPU. I was able to parallel my code on one node with 40*2=80 threads. Now, I need to solve much more problems, so I want to parallel on multiple nodes and more threads. I wonder how I should do this?

I saw this post, but couldn’t replicate what people suggested there. Let’s say I need 80 CPUs from 2 nodes.

First option: as suggested by @ChrisRackauckas in his blog. I use machine-file and write the slurm script as follows:

#!/bin/bash
#SBATCH --job-name=xz49
#SBATCH --partition=interactive
#SBATCH --nodes=2
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=40
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00:30:00

julia --machine-file $SLURM_JOB_NODELIST gurobi.jl

Then, I got the following error

ERROR: SystemError: opening file "/home/xz49/bc4u11n[1-2]": No such file or directory

In the example in his blog, he has

export SLURM_NODEFILE=`generate_pbs_nodefile`
./julia --machine-file $SLURM_NODEFILE /home/crackauc/test.jl

But I couldn’t find a way to generate the slurm node file. Also, how should I utilize the hyper-threads in this case?

Second option: use the ClusterManagers.jl package. My understanding is that I do not need to write a slurm script anymore. So, on top of my Julia script. I add

using ClusterManagers
addprocs_slurm(80, partition="interactive", time="00:30:00", mem_per_cpu="4G")

Then, I type julia test.jl in the shell to launch my Julia script. I got the following error

Error launching Slurm job:
ERROR: LoadError: TaskFailedException:
MethodError: no method matching replace(::String, ::String, ::String)

It seems that this is due to an unresolved bug with this package. Is there a way to circumvent this bug? Also, how should I utilize the hyper-threads in this case?

Third option: I guess this is wrong, but I have tried the following slurm script.

#SBATCH --nodes=2
#SBATCH --ntasks=80

But it seems that it will run the Julia script 80 times in parallel rather than parallel it on 80 cores.


Any suggestions will be greatly appreciated.

2 Likes

I’d recommend you try the master branch of ClusterManagers instead of the registered version. They haven’t tagged a version in a while, and the master has received several bugfixes. This works for me on slurm.

Hope I can help here. I saw you have an issue open regardign a license file error?

DO you happen to know what the layout of shared filesystems is on your cluster?
By that I mean what filesystems are mounted on the cluster master node and all compute nodes?

Hi, thanks for your help. I then have the following error, which was mentioned in this issue. Do you know how I may fix it?

Error launching Slurm job:
ERROR: LoadError: TaskFailedException:
KeyError: key "SLURM_JOB_ID" not found
Stacktrace:
 [1] (::Base.var"#459#460")(::String) at ./env.jl:79
 [2] access_env at ./env.jl:43 [inlined]
 [3] getindex at ./env.jl:79 [inlined]
 [4] launch(::SlurmManager, ::Dict{Symbol,Any}, ::Array{Distributed.WorkerConfig,1}, ::Base.GenericCondition{Base.AlwaysLockedST}) at /home/xz49/.julia/packages/ClusterManagers/P3lPY/src/slurm.jl:60
 [5] (::Distributed.var"#39#42"{SlurmManager,Dict{Symbol,Any},Array{Distributed.WorkerConfig,1},Base.GenericCondition{Base.AlwaysLockedST}})() at ./task.jl:358
Stacktrace:
 [1] wait at ./task.jl:267 [inlined]
 [2] addprocs_locked(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:nodes, :partition, :time, :mem_per_cpu),Tuple{Int64,String,String,String}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/cluster.jl:494
 [3] addprocs(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:nodes, :partition, :time, :mem_per_cpu),Tuple{Int64,String,String,String}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/cluster.jl:441
 [4] #addprocs_slurm#17 at /home/xz49/.julia/packages/ClusterManagers/P3lPY/src/slurm.jl:103 [inlined]
 [5] top-level scope at /home/xz49/gurobi.jl:6
 [6] include(::Module, ::String) at ./Base.jl:377
 [7] exec_options(::Base.JLOptions) at ./client.jl:288
 [8] _start() at ./client.jl:484
in expression starting at /home/xz49/gurobi.jl:6

Hi, thanks. Sorry, I am not sure what you mean. The only relevant information I found is NFS and Lustre.

As a workaround to this, you may just change the relevant lines in the code to get it to work for you. Run

julia> ]
pkg> dev ClusterManagers

and open up the source (usually ~/.julia/dev/ClusterManagers/src/slurm.jl). Remove the lines with the job id (comment out line 60 and remove the references to jobid from lines 61 and 66). This should get it to work, but you’ll not be able to run multiple slurm jobs in the same directory unfortunately.

In my cluster the variable SLURM_JOB_ID is usually set after a job is submitted (even if I’m running an interactive terminal on the compute node). However this may not be the case for many. Hopefully the issue is resolved soon.

Thanks again. Does that mean I cannot submit job arrays?

Also, I find (possibly) another way to work around by using salloc. So I type the following in the command line:

salloc --partition=interactive --nodes=2 --time=00:30:00 --ntasks-per-node=10 --mem-per-cpu=4G

Then, the following message will pop up:

salloc: Pending job allocation 819214
salloc: job 819214 queued and waiting for resources
salloc: job 819214 has been allocated resources
salloc: Granted job allocation 819214
salloc: Waiting for resource configuration
salloc: Nodes bc4u11n[2-3] are ready for job

Then, I run my Julia script by

julia test.jl

On the top of my test.jl, I have

using ClusterManagers
addprocs_slurm(20, nodes=2, partition="interactive", time="00:30:00", mem_per_cpu="4G")

which is consistent with what I reserved by salloc.

If this works for you then great, in fact I use something similar on my cluster. I haven’t tried out job arrays, but I guess they will work as they all have the same job id. Jobs with different IDs will simply overwrite the files and run into issues. If you are able to get the current master branch to work then that’s perfect.

thankyou. NFS is usually used fro your /home directory
Lustre will be a larger and faster filesystem and is usually referred to as scratch space

By doing so, I got 20 output files for each of the 20 CPUs on 2 nodes. Do you happen to know how I should change the setting so only one output file is generated?

This is the expected output. Each worker will write out its output to its own file, so eg if you’re printing out values to stdout then it’ll get redirected to a file. You ideally don’t want all of them to be sent to the same file, as it lets you look at output from each worker separately.

If you want to concatenate them all eventually you may do so in bash using something like cat job*.out >> job.out

I see, thanks.