Trying to debug a problem that may be due to julia
not using a provided nodefile
on a SLURM system. It appears that jobs are getting spun-off on hosts where it shouldn’t be.
The SLURM batch script is:
#!/bin/bash
#SBATCH --mail-user=myemail@email.com
#SBATCH --mail-type=ALL # Alerts sent when job begins, ends, or aborts
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --mem=100G
#SBATCH --job-name=indiv_array
#SBATCH --array=1-5
#SBATCH --time=03-00:00:00 # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=indiv_array_%A_%a.output # output and error messages go to this file
export SLURM_NODEFILE=`generate_pbs_nodefile`
julia --machinefile $SLURM_NODEFILE indiv_array.jl
Does anyone (a) see anything wrong with this, and (b) have any suggestions how I can check (from within the resulting Juila session) which nodes are actually being used to see if they correspond to the ones generated in SLURM_NODEFILE
? i.e. is there a function to report where julia
thinks it’s supposed to be running tasks I can print out?
You could use gethostname()
or even run(`hostname`)
. However, my recommendation would be to take a look at:
https://github.com/JuliaParallel/ClusterManagers.jl
And use the slurm
manager.
Thanks @raminammour! gethostname()
is perfect.
Re: cluster managers: is there something wrong with machinefile
? I can’t find any notes about it being depreciated or anything. Seems like it should be pretty straightforward, aside from the fact it isn’t working.
I’ve looked into the slurm manager, but I don’t think it’ll work on our system – sounds like I need a running Julia instance to manage all the spin-off procs created by the srun
execution. The problem is I can only run code on a gateway system, which wouldn’t leave that original julia instance running long enough to manage those spinoff processes.
What has worked on my system, is requesting one node in interactive mode and using it to spawn all the other processes, or maybe even in batch mode.
I have used machinefile
on an lsf
system a long time ago, it used to work. However, keep in mind that you would be using SSH to connect to all of them, you may run out of ports if you have a big enough job and it is slower anyway.
You’ll need to find out where it is on your system. Are you sure it’s located at generate_pbs_nodefile
? What directory are you running this from?
If you startup an MPI job you can just look for where the file pops up and save that location. It can be different on each cluster.
I appreciate that, but I think that’s gonna get real complicated in our system (it’s a university system with long waits for access). Thus the use of sbatch
. I’d have to (a) ask for an interactive session that would last longer than my task, then (b) when it comes up, request the resources for the real jobs and hope they become available in time.
Hmmm… I was told by the IT folks that that would execute and give the right path, but maybe they were sloppy when they gave me that and make implicit assumptions about where I was working from. I’ll investigate that.
@ChrisRackauckas
So generate_pbs_nodefile
returns an absolute path (e.g. /tmp/oQ8vxZUMRR
) and returns a text file like this:
[eubankn@vm-qa-node001 slurm]$ cat /tmp/oQ8vxZUMRR
vm-qa-node001
Seem appropriate?
@ChrisRackauckas Here’s a two-tasks on two-nodes version:
[eubankn@vm-qa-node001 slurm]$ salloc --partition=debug --nodes=2 --ntasks=2 --tasks-per-node=1
salloc: Granted job allocation 22594479
[eubankn@vm-qa-node001 slurm]$ generate_pbs_nodefile
/tmp/UUMuRd3rwj
[eubankn@vm-qa-node001 slurm]$ cat `generate_pbs_nodefile`
vm-qa-node001
vm-qa-node002
Yeah, that looks appropriate if the compute nodes you’re supposed to be using are named vm-qa-node001
and vm-qa-node002
.
yeah… OK, well, thanks for confirming I’m not doing something super stupid! Have reservation at 5pm to actually check whether parallel jobs report those as their hostnames. Fingers crossed!