This may be completely obvious and trivial to most people, but I constantly encounter problems with ClusterManagers.jl package on hpc systems. This is related to this discussion. The main reason is that the model of that package seems to be
- go to login/submission node, start julia
- load clustermanagers.jl, use a flavour of addprocs_*to submit a job and get a pool of worker processes in return
- do work interactively or launch script from that console
This never really works for me because
- the sysadmin hates me because I am doing computational work on the login node (master uses too many resources on that node), and I’m not supposed to.
- Logging on to a compute node to do the same as above does not work because one often cannot submit jobs to the scheduler from a compute nodes
As such, I haven’t found ClusterManagers.jl way to be feasible in a batch environment because you have to keep a julia session running on the submission node. Hopefully I’m getting this totally wrong and you are going to tell me otherwise now.
Given that issue I’ve been using a hack mentioned in the above link for quite a while now, and here’s the ParallelTest.jl package that tries to formalize that a bit more. The idea is simple:
- submit a job to the scheduler that will run a julia script script.jlon a compute node.
- the job requests a certain specification (3 nodes with x GB ram, say). This specification is visible as ENV var on the compute node
- 
script.jlcalls the functionmachinesinParallelTest, which reads those ENV vars (i.e. node names), manually compiles amachinelist, and callsaddprocson it. How exactly all of this works depends on the type of scheduler (SGE, slurm, PBS etc)
- only the slurm submit is tested for now.
slurm output
# run.slurm looks like this:
#!/bin/bash
#SBATCH --job-name=jltest
#SBATCH --output=partest.out
#SBATCH --error=partest.err
#SBATCH --partition=MISC
#SBATCH --nodes=2
srun -l julia script.jl
0: Started julia processes
1: Started julia processes
1: I am not the master, so goodbye
0: make everybody say hello
0: hi I am worker number 1, I live on magi98
0: 	From worker 3:	hi I am worker number 3, I live on magi99
0: 	From worker 8:	hi I am worker number 8, I live on magi99
0: 	From worker 9:	hi I am worker number 9, I live on magi99
0: 	From worker 11:	hi I am worker number 11, I live on magi99
0: 	From worker 10:	hi I am worker number 10, I live on magi99
0: 	From worker 2:	hi I am worker number 2, I live on magi98
0: 	From worker 5:	hi I am worker number 5, I live on magi98
0: 	From worker 4:	hi I am worker number 4, I live on magi98
0: 	From worker 7:	hi I am worker number 7, I live on magi98
0: 	From worker 6:	hi I am worker number 6, I live on magi98
0: make everybody do some math
0: 	From worker 2:	Hi, I am worker number 2 doing some math
0: 	From worker 5:	Hi, I am worker number 5 doing some math
0: 	From worker 7:	Hi, I am worker number 7 doing some math
0: 	From worker 4:	Hi, I am worker number 4 doing some math
0: 	From worker 3:	Hi, I am worker number 3 doing some math
0: 	From worker 6:	Hi, I am worker number 6 doing some math
0: 	From worker 11:	Hi, I am worker number 11 doing some math
0: 	From worker 9:	Hi, I am worker number 9 doing some math
0: 	From worker 10:	Hi, I am worker number 10 doing some math
0: 	From worker 8:	Hi, I am worker number 8 doing some math
0: serial call takes 17.646472431
0: parallel call takes 2.728280852
0:  quitting 
still work in progress, comments/PR welcome, and hopefully there’s an easier solution out there!