Distributed computing over SLURM array

Hi,

New to Julia, and also new to SLURM. I’m thinking about the following workflow for my project and would like to see if it seems a sensible strategy, especially considering the specificities of the Julia language.

I plan to run hundreds (thousands…) of completely independent numerical simulations with varying parameters, where each simulation takes anywhere between a few minutes to a few hours to complete. I have access to a SLURM high performance cluster with a very large number of CPU, yet I can request approximately 64 of them at most per task (although I can run several of these 64-CPU tasks at the same time of course) as most machines have fewer processors than this.

Given this 64 CPU limit, I am not envisioning to use exclusively Julia’s native multiprocessing modules since it would require running serially a lot of 64 simulation batches. I think SLURM task arrays (i.e., using something like #SBATCH --array=1-1000) would be better suited for the job, where my main Julia computing script could be called X times, these processes would be distributed among the nodes for a total far greater than 64 simultaneous tasks.

Does this sound like an optimal strategy given my project? Anything I have overlooked? Or are there other strategies I could consider native to Julia?

Thanks.

2 Likes

Could you give an example of how SLURM arrays work? I use SLURM quite often but through ClusterManagers.jl.

Could you give an example of how SLURM arrays work? I use SLURM quite often but through ClusterManagers.jl.

If you have a submit script like this:

myscript.sh

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --array=1-30

module load julia
julia myfile.jl

and you run sbatch myscript.sh, 30 identical jobs are submitted, and each job can be identified by a SLURM_ARRAY_TASK_ID

At the top of myfile.jl you can put:
id = Base.parse(Int, ENV["SLURM_ARRAY_TASK_ID"])

so that every run of your Julia file has a unique number it can use to set up parameters etc.

This is how I’ve always submitted serial jobs to the cluster. I don’t have the expertise to answer your question @tb877, but I have used array jobs to submit thousands of simulations successfully. It’s what everyone in my office does (but we are all just copying each other). I’m not sure if submitting the jobs with cpu-per-task=64 ends up with a better use of resources because you use every cpu on the node. Perhaps this has some effect on your scheduling priority over the long term. But I’ve never had any problems just using array jobs.

2 Likes

Wonderful. I’ll have to pass one command line argument less to the julia script :wink:

Right well then I guess my strategy should make sense. I guess I’ll have to try it to know :slight_smile:

@affans: I have seen a few posts about ClusterManagers.jl (and read the docs) but not a lot of MWE. My workflow is pretty simple (i.e., no communication between jobs, etc.) so I guess it shouldn’t be so complicated. Would you mind sharing a few lines of code about your strategy using this module?

When you submit julia myfile.jl 30 individual times, you are launching a fresh instance and all the code has to be compiled per instance, which is not a big deal if your simulation is long-running, but still.

Here is how I use ClusterManagers to manage everything within Julia (i.e., no bash files).

using ClusterManagers 
addprocs(SlurmManager(500), N=16, verbose="", topology=:master_worker, exeflags="--project=$(Base.active_project())")

This adds 500 worker processes over N = 16 nodes (my cluster has 32 cpus per node). I also pass in an argument to activate the correct environment.

Then it’s a matter of running pmap

cd = pmap(1:nsims) do x
        @info "Starting simulation $x on host $(gethostname()), id: $(myid())"
        run_model() 
end

The variable cd is an array of each output of run_model(). If nsims = 2000, then ClusterManagers will automatically launch jobs in batches of 500 (as that’s how many workers I’ve requested). I then run my averaging scripts/plotting on cd.

2 Likes

This is exactly the kind of thing I was “worried” about (I even started writing along these lines in my original post but ended up not including it). My simulations are indeed usually long (several minutes to a few hours) so I think I’m okay here, but good call.

Thanks for the example code. Is there any advantage to create the SLURM array through Julia? It seems more complicated than through a shell script tbh. I suppose I’ll start with a shell script but keep this in mind when I’m more familiar with Julia, as it would eliminate the need for a shell script altogether.

I think I am confused. Do you submit one job to the scheduler, requesting 16 nodes with all 32 cpus per node? And then ClusterManagers handles what resources you have available? But you said no bash files, so you don’t have a submissoin script that requests the resources from the slurm scheduler? e.g. nodes=16, cpus-per-node=32?

Other than compilation time, is there an advantage? Does requesting all the CPUs on the node have advantages? Also if your jobs go through together, then the compilation is happening in parallel so it isn’t too bad. But also, if you have to request 16 full nodes, is this not difficult for the scheduler as you need 16 nodes to be free? Whereas if you submit as an array job, once a node becomes free your job can go through? I think I must have how this works in my head wrong.

Yes, it’s one job. addprocs() + SLURM allocates 500 CPUs to me over 16 nodes. Here is the output of addprocs() which calls srun internally

julia> using ClusterManagers, Distributed

julia> addprocs(SlurmManager(50), N=15, verbose="", topology=:master_worker)
[ Info: Starting SLURM job julia-88628: `srun -J julia-88628 -n 50 -D /home/affans -N 15 --verbose -o /home/affans/./julia-88628-17267677895-%4t.out /home/affans/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/julia --worker=gYTQcR6SrovUPs3d`
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user           : `affans'
srun: uid            : 1005
srun: gid            : 1003
srun: cwd            : /home/affans
srun: ntasks         : 50 (set)
srun: nodes          : 15 (set)
srun: jobid          : 4294967294 (default)
srun: partition      : default
srun: profile        : `NotSet'
srun: job name       : `julia-88628'
srun: reservation    : `(null)'
srun: burst_buffer   : `(null)'
srun: wckey          : `(null)'
srun: cpu_freq_min   : 4294967294
srun: cpu_freq_max   : 4294967294
srun: cpu_freq_gov   : 4294967294
srun: switches       : -1
srun: wait-for-switches : -1
srun: distribution   : unknown
srun: cpu_bind       : default
srun: mem_bind       : default
srun: verbose        : 1
srun: slurmd_debug   : 0
srun: immediate      : false
srun: label output   : false
srun: unbuffered IO  : false
srun: overcommit     : false
srun: threads        : 60
srun: checkpoint_dir : /var/slurm/checkpoint
srun: wait           : 0
srun: nice           : -2
srun: account        : (null)
srun: comment        : (null)
srun: dependency     : (null)
srun: exclusive      : false
srun: bcast          : false
srun: qos            : (null)
srun: constraints    :
srun: geometry       : (null)
srun: reboot         : yes
srun: rotate         : no
srun: preserve_env   : false
srun: network        : (null)
srun: propagate      : NONE
srun: prolog         : (null)
srun: epilog         : (null)
srun: mail_type      : NONE
srun: mail_user      : (null)
srun: task_prolog    : (null)
srun: task_epilog    : (null)
srun: multi_prog     : no
srun: sockets-per-node  : -2
srun: cores-per-socket  : -2
srun: threads-per-core  : -2
srun: ntasks-per-node   : -2
srun: ntasks-per-socket : -2
srun: ntasks-per-core   : -2
srun: plane_size        : 4294967294
srun: core-spec         : NA
srun: power             :
srun: remote command    : `/home/affans/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/julia --worker=gYTQcR6SrovUPs3d'
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
[ Info: Worker 0 (after 0 s): No output file "/home/affans/./julia-88628-17267677895-0000.out" yet
srun: Nodes node[001-010,012-016] are ready for job
srun: jobid 10058: nodes(15):`node[001-010,012-016]', cpu counts: 64(x15)
srun: launching 10058.0 on host node001, 4 tasks: [0-3]
srun: launching 10058.0 on host node002, 4 tasks: [4-7]
srun: launching 10058.0 on host node003, 4 tasks: [8-11]
srun: launching 10058.0 on host node004, 4 tasks: [12-15]
srun: launching 10058.0 on host node005, 4 tasks: [16-19]
srun: launching 10058.0 on host node006, 3 tasks: [20-22]
srun: launching 10058.0 on host node007, 3 tasks: [23-25]
srun: launching 10058.0 on host node008, 3 tasks: [26-28]
srun: launching 10058.0 on host node009, 3 tasks: [29-31]
srun: launching 10058.0 on host node010, 3 tasks: [32-34]
srun: launching 10058.0 on host node012, 3 tasks: [35-37]
srun: launching 10058.0 on host node013, 3 tasks: [38-40]
srun: launching 10058.0 on host node014, 3 tasks: [41-43]
srun: launching 10058.0 on host node015, 3 tasks: [44-46]
srun: launching 10058.0 on host node016, 3 tasks: [47-49]
srun: route default plugin loaded
srun: Node node012, 3 tasks started
srun: Node node016, 3 tasks started
srun: Node node004, 4 tasks started
srun: Node node007, 3 tasks started
srun: Node node005, 4 tasks started
srun: Node node001, 4 tasks started
srun: Node node010, 3 tasks started
srun: Node node013, 3 tasks started
srun: Node node008, 3 tasks started
srun: Node node014, 3 tasks started
srun: Node node003, 4 tasks started
srun: Node node002, 4 tasks started
srun: Node node006, 3 tasks started
srun: Node node015, 3 tasks started
srun: Node node009, 3 tasks started

Here is what squeue reports

[affans@hpc ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             10058      defq julia-88   affans  R       1:05     15 node[001-010,012-016]

Running 10 “simulations” looks like this:

julia> pmap(1:10) do x
        gethostname()
       end
10-element Vector{String}:
 "node002"
 "node003"
 "node002"
 "node007"
 "node005"
 "node005"
 "node004"
 "node004"
 "node005"
 "node002"

where each “simulation” simply returned the node on which the function was run.

@tb877 The advantage here is that you are fully contained within julia – as your simulations finish, you can process them right away while they are in memory. But as @p_f says, my way will request all n nodes to be available. On the other hand, I don’t really have much experience with Slurm arrays but maybe I can play around with that later on.

2 Likes

Thanks @affans, that is very clear!

I will definitely come back to it when I am further along in the project.

I see! This is very interesting and good to be aware of.

Haven’t quite worked out how this would work with the cluster I’m on, as the cluster managers are constantly warning “Do not run jobs on the login node”. Obviously the actual work here isn’t being performed on the login node, but I’m not sure what their criteria is for killing code that is running on login nodes. I think using up lots of memory on the login node might be a no-no, I’m not sure.

Thanks for the write up!

Well if SLURM is configured properly, then worker processes that Julia launches should always be on the compute nodes. I have a similar problem where my head node has only 32gb of RAM while the compute nodes have 128gb. This means I have to be careful when all the data is sent back from the compute nodes back tot he end node (i.e., when pmap finishes running).