I recently set up some scripts for running Julia jobs on a Slurm cluster. All of my jobs just use a single node with multiple CPUs. I’ll describe my approach, but I’m not an expert in HPC, so I’m not sure if everything that I’m doing is 100% correct.
My approach is to write two scripts: a Slurm script and a Julia script. I currently am not using ClusterManagers
. My mental model is that if I request one node with multiple CPUs, Slurm provisions a virtual machine with multiple cores, and Julia will be able to detect those cores automatically just like it does on my laptop. So basically all I need to do is using Distributed; addprocs(4)
and then parallelize my code with @distributed
or pmap
.
Example 1
Slurm Script (“test_distributed.slurm”)
#!/bin/bash
#SBATCH -p <list of partition names>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:05:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>
julia test_distributed.jl
Julia Script (“test_distributed.jl”)
using Distributed
# launch worker processes
addprocs(4)
println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())
# each worker gets its id, process id and hostname
for i in workers()
id, pid, host = fetch(@spawnat i (myid(), getpid(), gethostname()))
println(id, " " , pid, " ", host)
end
# remove the workers
for i in workers()
rmprocs(i)
end
Output File
Number of processes: 5
Number of workers: 4
2 2331013 cn1081
3 2331015 cn1081
4 2331016 cn1081
5 2331017 cn1081
Example 2
In this example I run a parallel for
loop with @distributed
. The body of the for
loop has a 5 minute sleep
call. I verified that the loop iterations are in fact running in parallel by recording the run time for the whole job. The run time for this job was 00:05:19, rather than the 00:20:00 run time that would be expected if the code was running serially.
Slurm Script (“test_distributed2.slurm”)
#!/bin/bash
#SBATCH -p <list of partition names>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:30:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>
julia test_distributed2.jl
Julia Script (“test_distributed2.jl”)
using Distributed
addprocs(4)
println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())
@sync @distributed for i in 1:4
sleep(300)
id, pid, host = myid(), getpid(), gethostname()
println(id, " " , pid, " ", host)
end
for i in workers()
rmprocs(i)
end
Output File
Number of processes: 5
Number of workers: 4
From worker 2: 2 2334507 cn1081
From worker 3: 3 2334509 cn1081
From worker 5: 5 2334511 cn1081
From worker 4: 4 2334510 cn1081
Comments
If you’re using a Project.toml
or Manifest.toml
you will probably need to call addprocs
like this:
addprocs(4; exeflags="--project")
I also had to jump through some hoops to run a project that had dependencies in private Github repos. I think it boiled down to instantiating a Manifest.toml
file that contained the appropriate links to the private Github repos, but I didn’t document the full process…