Turns out (The Vanderbilt IT team and I have discovered) the problem with this strategy is that because julia opens new processes using ssh, they escape SLURM’s notice. As a result, my parallel workers were running outside of SLURMs awareness (technical term is, I believe, outside the cgroup), taking up memory unexpectedly and not always shutting down when scancel was called on the main task (at one point I apparently had >30 zombie processes running on the research cluster, even though my squeue was clean).
I think ClusterManager.jl solves this, but doesn’t seem to work well for busy slurm clusters (that generally require use of sbatch script and long waits for resources) since it uses srun.
When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.
That’s why a sbatch script has usually one or several srun command in it.
Huh, I didn’t know that would work either. Awesome @vchuravy. One note: does default addprocs() do the correct thing like addprocs(SlurmManager(2)) when in a cluster job? What I mean is, does addprocs() automatically recognize that it should use the SlurmManager with 2 process when it’s called from a SLURM job with 2 cores, or is that asking too much?