Debugging possible issue with `machinefile` option on SLURM system

Trying to debug a problem that may be due to julia not using a provided nodefile on a SLURM system. It appears that jobs are getting spun-off on hosts where it shouldn’t be.

The SLURM batch script is:

#!/bin/bash

#SBATCH --mail-user=myemail@email.com
#SBATCH --mail-type=ALL  # Alerts sent when job begins, ends, or aborts
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --mem=100G
#SBATCH --job-name=indiv_array
#SBATCH --array=1-5
#SBATCH --time=03-00:00:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=indiv_array_%A_%a.output  # output and error messages go to this file

export SLURM_NODEFILE=`generate_pbs_nodefile`

julia --machinefile $SLURM_NODEFILE indiv_array.jl

Does anyone (a) see anything wrong with this, and (b) have any suggestions how I can check (from within the resulting Juila session) which nodes are actually being used to see if they correspond to the ones generated in SLURM_NODEFILE? i.e. is there a function to report where julia thinks it’s supposed to be running tasks I can print out?

You could use gethostname() or even run(`hostname`). However, my recommendation would be to take a look at:

https://github.com/JuliaParallel/ClusterManagers.jl

And use the slurm manager.

Thanks @raminammour! gethostname() is perfect.

Re: cluster managers: is there something wrong with machinefile? I can’t find any notes about it being depreciated or anything. Seems like it should be pretty straightforward, aside from the fact it isn’t working. :confused:

I’ve looked into the slurm manager, but I don’t think it’ll work on our system – sounds like I need a running Julia instance to manage all the spin-off procs created by the srun execution. The problem is I can only run code on a gateway system, which wouldn’t leave that original julia instance running long enough to manage those spinoff processes.

What has worked on my system, is requesting one node in interactive mode and using it to spawn all the other processes, or maybe even in batch mode.

I have used machinefile on an lsf system a long time ago, it used to work. However, keep in mind that you would be using SSH to connect to all of them, you may run out of ports if you have a big enough job and it is slower anyway.

You’ll need to find out where it is on your system. Are you sure it’s located at generate_pbs_nodefile? What directory are you running this from?

If you startup an MPI job you can just look for where the file pops up and save that location. It can be different on each cluster.

I appreciate that, but I think that’s gonna get real complicated in our system (it’s a university system with long waits for access). Thus the use of sbatch. I’d have to (a) ask for an interactive session that would last longer than my task, then (b) when it comes up, request the resources for the real jobs and hope they become available in time.

Hmmm… I was told by the IT folks that that would execute and give the right path, but maybe they were sloppy when they gave me that and make implicit assumptions about where I was working from. I’ll investigate that.

@ChrisRackauckas

So generate_pbs_nodefile returns an absolute path (e.g. /tmp/oQ8vxZUMRR) and returns a text file like this:

[eubankn@vm-qa-node001 slurm]$ cat /tmp/oQ8vxZUMRR
vm-qa-node001

Seem appropriate?

@ChrisRackauckas Here’s a two-tasks on two-nodes version:

[eubankn@vm-qa-node001 slurm]$ salloc --partition=debug --nodes=2 --ntasks=2 --tasks-per-node=1
salloc: Granted job allocation 22594479
[eubankn@vm-qa-node001 slurm]$ generate_pbs_nodefile
/tmp/UUMuRd3rwj
[eubankn@vm-qa-node001 slurm]$ cat `generate_pbs_nodefile`
vm-qa-node001
vm-qa-node002

Yeah, that looks appropriate if the compute nodes you’re supposed to be using are named vm-qa-node001 and vm-qa-node002.

yeah… :confused: OK, well, thanks for confirming I’m not doing something super stupid! Have reservation at 5pm to actually check whether parallel jobs report those as their hostnames. Fingers crossed!