I’m looking to provide an explicit explanation of the simplest way to get a Julia program running in parallel across multiple nodes in a SLURM cluster (basically a minimal working example that illustrates the logic).
My impression so far is that there are two primary ways to run Julia in a SLURM cluster. Suppose I want to define a function and run it in a parallel for-loop on N cores distributed across M nodes in the cluster:
Option 1
The first option comes from this Stackoverflow post. Basically, you can use some functions from the ClusterManagers
package in your code and then just run Julia as normal without having to explicitly write a SLURM script.
The example program:
# File name
# slurm_example.jl
using Distributed
using ClusterManagers
# Add N workers across M nodes
addprocs_slurm(N, nodes=M, exename="/path/to/julia/bin/julia", rest of SLURM kwargs...)
# Define function
@everywhere function myFunction(args)
Code goes here...
end
# Run function K times in parallel
@parallel for i=1:K
myFunction(args)
end
As I understand it, to run this program, I would simply execute
julia slurm_example.jl
from the command line while logged into the cluster. Then the addprocs_slurm
function runs the rest of the Julia code as an interactive SLURM job, the equivalent of using srun
with the specified SLURM options.
Option 2
The second option, exemplified in this post, involves writing a SLURM script for a batch job calling Julia with the --machinefile
flag. In this case, the example program is:
# File name
# slurm_example.jl
using Distributed
using ClusterManagers
# Define function
@everywhere function myFunction(args)
Code goes here...
end
# Run function K times in parallel
@parallel for i=1:K
myFunction(args)
end
# Kill the workers
for i in workers()
rmprocs(i)
end
Then to run this, I would need to write and execute a separate SLURM script that looks something like this:
#!/bin/bash
#SBATCH --ntasks=N # N cores
#SBATCH --nodes=M # M nodes
# Rest of #SBATCH flags go here...
julia --machinefile=$SLURM_NODEFILE slurm_example.jl
One thing I find confusing about this example is that I don’t understand why I don’t need to do something like
addprocs(SlurmManager(N))
in the Julia code? Or do I? Are there any glaring errors with this code? Is the main difference between the two options just that Option 1 is an interactive SLURM job and the other a batch job?
Thanks ahead of time for any feedback.