[Ann] julia in parallel batch mode: job schedulers, etc

This may be completely obvious and trivial to most people, but I constantly encounter problems with ClusterManagers.jl package on hpc systems. This is related to this discussion. The main reason is that the model of that package seems to be

  1. go to login/submission node, start julia
  2. load clustermanagers.jl, use a flavour of addprocs_* to submit a job and get a pool of worker processes in return
  3. do work interactively or launch script from that console

This never really works for me because

  1. the sysadmin hates me because I am doing computational work on the login node (master uses too many resources on that node), and I’m not supposed to.
  2. Logging on to a compute node to do the same as above does not work because one often cannot submit jobs to the scheduler from a compute nodes

As such, I haven’t found ClusterManagers.jl way to be feasible in a batch environment because you have to keep a julia session running on the submission node. Hopefully I’m getting this totally wrong and you are going to tell me otherwise now.

Given that issue I’ve been using a hack mentioned in the above link for quite a while now, and here’s the ParallelTest.jl package that tries to formalize that a bit more. The idea is simple:

  1. submit a job to the scheduler that will run a julia script script.jl on a compute node.
  2. the job requests a certain specification (3 nodes with x GB ram, say). This specification is visible as ENV var on the compute node
  3. script.jl calls the function machines in ParallelTest, which reads those ENV vars (i.e. node names), manually compiles a machine list, and calls addprocs on it. How exactly all of this works depends on the type of scheduler (SGE, slurm, PBS etc)
  4. only the slurm submit is tested for now.

slurm output

# run.slurm looks like this:

#!/bin/bash
#SBATCH --job-name=jltest
#SBATCH --output=partest.out
#SBATCH --error=partest.err
#SBATCH --partition=MISC
#SBATCH --nodes=2

srun -l julia script.jl
0: Started julia processes
1: Started julia processes
1: I am not the master, so goodbye
0: make everybody say hello
0: hi I am worker number 1, I live on magi98
0: 	From worker 3:	hi I am worker number 3, I live on magi99
0: 	From worker 8:	hi I am worker number 8, I live on magi99
0: 	From worker 9:	hi I am worker number 9, I live on magi99
0: 	From worker 11:	hi I am worker number 11, I live on magi99
0: 	From worker 10:	hi I am worker number 10, I live on magi99
0: 	From worker 2:	hi I am worker number 2, I live on magi98
0: 	From worker 5:	hi I am worker number 5, I live on magi98
0: 	From worker 4:	hi I am worker number 4, I live on magi98
0: 	From worker 7:	hi I am worker number 7, I live on magi98
0: 	From worker 6:	hi I am worker number 6, I live on magi98
0: make everybody do some math
0: 	From worker 2:	Hi, I am worker number 2 doing some math
0: 	From worker 5:	Hi, I am worker number 5 doing some math
0: 	From worker 7:	Hi, I am worker number 7 doing some math
0: 	From worker 4:	Hi, I am worker number 4 doing some math
0: 	From worker 3:	Hi, I am worker number 3 doing some math
0: 	From worker 6:	Hi, I am worker number 6 doing some math
0: 	From worker 11:	Hi, I am worker number 11 doing some math
0: 	From worker 9:	Hi, I am worker number 9 doing some math
0: 	From worker 10:	Hi, I am worker number 10 doing some math
0: 	From worker 8:	Hi, I am worker number 8 doing some math
0: serial call takes 17.646472431
0: parallel call takes 2.728280852
0:  quitting 

still work in progress, comments/PR welcome, and hopefully there’s an easier solution out there!

7 Likes

I’ve been able to acquire the resources first and do addprocs(SlurmManager(8)) from the compute node, as described here Issues with machinefile and SLURM - #2 by vchuravy . It finds my existing resources automatically. Does that work for you?

1 Like

As it happens, no that doesn’t work on my system. This call always times out. But note that this is only one problem, the actual problem above is that what you are doing implies computational load on the login node. Which is undesirable in general.