Issues with machinefile and SLURM

nickeubank · December 20, 2017, 4:47pm

Spent last week fighting with issues on a slurm cluster, and having finally figured it out, I wanted to share the result (and a warning):

I had been parallelizing through a slurm batch script with this call:

julia --machinefile $SLURM_NODEFILE indiv_array.jl

(For full script, see here)

Turns out (The Vanderbilt IT team and I have discovered) the problem with this strategy is that because julia opens new processes using ssh, they escape SLURM’s notice. As a result, my parallel workers were running outside of SLURMs awareness (technical term is, I believe, outside the cgroup), taking up memory unexpectedly and not always shutting down when scancel was called on the main task (at one point I apparently had >30 zombie processes running on the research cluster, even though my squeue was clean).

I think ClusterManager.jl solves this, but doesn’t seem to work well for busy slurm clusters (that generally require use of sbatch script and long waits for resources) since it uses srun.

So… oops.

CC: @ChrisRackauckas @raminammour

vchuravy · December 20, 2017, 7:23pm

Using srun should be fine since that is the correct way of starting a job.

The workflow should work something like this:

salloc | sbatch # create resources.
julia> addprocs(SlurmManager(2)) # SlurmManager should inherit the outside allocation.

nickeubank · December 20, 2017, 7:41pm

OH! So an srun executed inside a slurm allocation doesn’t try to create a new allocaiton; it start processes in that existing allocation?

vchuravy · December 20, 2017, 7:53pm

Yes. See Slurm Workload Manager - srun

Run a parallel job on cluster managed by Slurm. If necessary, srun will first create a resource allocation in which to run the parallel job.

srun in general is the right way of starting jobs within an allocation and crucially within sbatch.

https://slurm.schedmd.com/sbatch.html

When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

That’s why a sbatch script has usually one or several srun command in it.

nickeubank · December 21, 2017, 5:42pm

This conversation is revelatory. Thank you!!

ChrisRackauckas · December 21, 2017, 5:46pm

Huh, I didn’t know that would work either. Awesome @vchuravy. One note: does default addprocs() do the correct thing like addprocs(SlurmManager(2)) when in a cluster job? What I mean is, does addprocs() automatically recognize that it should use the SlurmManager with 2 process when it’s called from a SLURM job with 2 cores, or is that asking too much?

vchuravy · December 21, 2017, 6:08pm

That is asking to much. We would have to redefine addprocs when loading ClusterManager.

vchuravy · December 21, 2017, 8:56pm

A post was split to a new topic: Issues running on a PBS cluster

Topic		Replies	Views
[Ann] julia in parallel batch mode: job schedulers, etc Julia at Scale announcement	2	1749	November 26, 2018
Issues running on a PBS cluster Julia at Scale parallel , cluster , distributed	0	854	December 21, 2017
Debugging possible issue with `machinefile` option on SLURM system Julia at Scale	9	1726	December 19, 2017
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2207	January 3, 2018
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3622	May 20, 2020

Issues with machinefile and SLURM

Related topics