Running Julia in a SLURM Cluster

I recently set up some scripts for running Julia jobs on a Slurm cluster. All of my jobs just use a single node with multiple CPUs. I’ll describe my approach, but I’m not an expert in HPC, so I’m not sure if everything that I’m doing is 100% correct.

My approach is to write two scripts: a Slurm script and a Julia script. I currently am not using ClusterManagers. My mental model is that if I request one node with multiple CPUs, Slurm provisions a virtual machine with multiple cores, and Julia will be able to detect those cores automatically just like it does on my laptop. So basically all I need to do is using Distributed; addprocs(4) and then parallelize my code with @distributed or pmap.

Example 1

Slurm Script (“test_distributed.slurm”)

#!/bin/bash

#SBATCH -p <list of partition names>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:05:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>

julia test_distributed.jl

Julia Script (“test_distributed.jl”)

using Distributed

# launch worker processes
addprocs(4)

println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())

# each worker gets its id, process id and hostname
for i in workers()
    id, pid, host = fetch(@spawnat i (myid(), getpid(), gethostname()))
    println(id, " " , pid, " ", host)
end

# remove the workers
for i in workers()
    rmprocs(i)
end

Output File

Number of processes: 5
Number of workers: 4
2 2331013 cn1081
3 2331015 cn1081
4 2331016 cn1081
5 2331017 cn1081

Example 2

In this example I run a parallel for loop with @distributed. The body of the for loop has a 5 minute sleep call. I verified that the loop iterations are in fact running in parallel by recording the run time for the whole job. The run time for this job was 00:05:19, rather than the 00:20:00 run time that would be expected if the code was running serially.

Slurm Script (“test_distributed2.slurm”)

#!/bin/bash

#SBATCH -p <list of partition names>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:30:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>

julia test_distributed2.jl

Julia Script (“test_distributed2.jl”)

using Distributed

addprocs(4)

println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())

@sync @distributed for i in 1:4
    sleep(300)
    id, pid, host = myid(), getpid(), gethostname()
    println(id, " " , pid, " ", host)
end

for i in workers()
    rmprocs(i)
end

Output File

Number of processes: 5
Number of workers: 4
      From worker 2:	2 2334507 cn1081
      From worker 3:	3 2334509 cn1081
      From worker 5:	5 2334511 cn1081
      From worker 4:	4 2334510 cn1081

Comments

If you’re using a Project.toml or Manifest.toml you will probably need to call addprocs like this:

addprocs(4; exeflags="--project")

I also had to jump through some hoops to run a project that had dependencies in private Github repos. I think it boiled down to instantiating a Manifest.toml file that contained the appropriate links to the private Github repos, but I didn’t document the full process…

20 Likes