Issues with machinefile and SLURM

Spent last week fighting with issues on a slurm cluster, and having finally figured it out, I wanted to share the result (and a warning):

I had been parallelizing through a slurm batch script with this call:

julia --machinefile $SLURM_NODEFILE indiv_array.jl

(For full script, see here)

Turns out (The Vanderbilt IT team and I have discovered) the problem with this strategy is that because julia opens new processes using ssh, they escape SLURM’s notice. As a result, my parallel workers were running outside of SLURMs awareness (technical term is, I believe, outside the cgroup), taking up memory unexpectedly and not always shutting down when scancel was called on the main task (at one point I apparently had >30 zombie processes running on the research cluster, even though my squeue was clean).

I think ClusterManager.jl solves this, but doesn’t seem to work well for busy slurm clusters (that generally require use of sbatch script and long waits for resources) since it uses srun.

So… oops.

CC: @ChrisRackauckas @raminammour

Using srun should be fine since that is the correct way of starting a job.

The workflow should work something like this:

salloc | sbatch # create resources.
julia> addprocs(SlurmManager(2)) # SlurmManager should inherit the outside allocation.

OH! So an srun executed inside a slurm allocation doesn’t try to create a new allocaiton; it start processes in that existing allocation?

Yes. See Slurm Workload Manager - srun

Run a parallel job on cluster managed by Slurm. If necessary, srun will first create a resource allocation in which to run the parallel job.

srun in general is the right way of starting jobs within an allocation and crucially within sbatch.

When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

That’s why a sbatch script has usually one or several srun command in it.

1 Like

This conversation is revelatory. Thank you!!

Huh, I didn’t know that would work either. Awesome @vchuravy. One note: does default addprocs() do the correct thing like addprocs(SlurmManager(2)) when in a cluster job? What I mean is, does addprocs() automatically recognize that it should use the SlurmManager with 2 process when it’s called from a SLURM job with 2 cores, or is that asking too much?

That is asking to much. We would have to redefine addprocs when loading ClusterManager.

A post was split to a new topic: Issues running on a PBS cluster