I’m looking for some guidance on how to get started with the following scenario. I have a cluster with multiple nodes, each node has 48 cores. The job scheduling is done with slurm. Currently, I’ve been running single node distributed jobs that are of the form (computation1.jl
):
using Distributed
using SharedArrays
@everywhere include("setup1.jl"); # preprocessing and setup code
results = SharedArray{Float64}(n_samples);
@sync @distributed for i in 1:n_samples
results[i] = f(i); # f is defined in setup1.jl
end
# save to disk
Which is to say, this is is a perfectly parallelizable job with no interaction between each iterate.
This is then launched with slurm using the command sbatch job1.sh
, where this job script looks like:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --mem=128GB
#SBATCH --cpus-per-task=1
...
julia -p 48 computation1.jl
This works as expected as a single node 48 process job.
But I would like to move to run with multiple nodes, to leverage the resources I have for larger jobs. I’ve looked a bit at ClusterManagers.jl
and the Julia documentation, but I’m struggling to see how to modify both my Julia code and my slurm script to run properly. Thanks for any help.