Debugging possible issue with `machinefile` option on SLURM system

nickeubank · December 19, 2017, 5:49pm

Trying to debug a problem that may be due to julia not using a provided nodefile on a SLURM system. It appears that jobs are getting spun-off on hosts where it shouldn’t be.

The SLURM batch script is:

#!/bin/bash

#SBATCH --mail-user=myemail@email.com
#SBATCH --mail-type=ALL  # Alerts sent when job begins, ends, or aborts
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --mem=100G
#SBATCH --job-name=indiv_array
#SBATCH --array=1-5
#SBATCH --time=03-00:00:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=indiv_array_%A_%a.output  # output and error messages go to this file

export SLURM_NODEFILE=`generate_pbs_nodefile`

julia --machinefile $SLURM_NODEFILE indiv_array.jl

Does anyone (a) see anything wrong with this, and (b) have any suggestions how I can check (from within the resulting Juila session) which nodes are actually being used to see if they correspond to the ones generated in SLURM_NODEFILE? i.e. is there a function to report where julia thinks it’s supposed to be running tasks I can print out?

raminammour · December 19, 2017, 6:16pm

You could use gethostname() or even run(`hostname`). However, my recommendation would be to take a look at:

https://github.com/JuliaParallel/ClusterManagers.jl

And use the slurm manager.

nickeubank · December 19, 2017, 6:30pm

Thanks @raminammour! gethostname() is perfect.

Re: cluster managers: is there something wrong with machinefile? I can’t find any notes about it being depreciated or anything. Seems like it should be pretty straightforward, aside from the fact it isn’t working.

I’ve looked into the slurm manager, but I don’t think it’ll work on our system – sounds like I need a running Julia instance to manage all the spin-off procs created by the srun execution. The problem is I can only run code on a gateway system, which wouldn’t leave that original julia instance running long enough to manage those spinoff processes.

raminammour · December 19, 2017, 6:37pm

What has worked on my system, is requesting one node in interactive mode and using it to spawn all the other processes, or maybe even in batch mode.

I have used machinefile on an lsf system a long time ago, it used to work. However, keep in mind that you would be using SSH to connect to all of them, you may run out of ports if you have a big enough job and it is slower anyway.

ChrisRackauckas · December 19, 2017, 6:39pm

You’ll need to find out where it is on your system. Are you sure it’s located at generate_pbs_nodefile? What directory are you running this from?

If you startup an MPI job you can just look for where the file pops up and save that location. It can be different on each cluster.

nickeubank · December 19, 2017, 6:41pm

I appreciate that, but I think that’s gonna get real complicated in our system (it’s a university system with long waits for access). Thus the use of sbatch. I’d have to (a) ask for an interactive session that would last longer than my task, then (b) when it comes up, request the resources for the real jobs and hope they become available in time.

Hmmm… I was told by the IT folks that that would execute and give the right path, but maybe they were sloppy when they gave me that and make implicit assumptions about where I was working from. I’ll investigate that.

nickeubank · December 19, 2017, 7:00pm

@ChrisRackauckas

So generate_pbs_nodefile returns an absolute path (e.g. /tmp/oQ8vxZUMRR) and returns a text file like this:

[eubankn@vm-qa-node001 slurm]$ cat /tmp/oQ8vxZUMRR
vm-qa-node001

Seem appropriate?

nickeubank · December 19, 2017, 7:04pm

@ChrisRackauckas Here’s a two-tasks on two-nodes version:

[eubankn@vm-qa-node001 slurm]$ salloc --partition=debug --nodes=2 --ntasks=2 --tasks-per-node=1
salloc: Granted job allocation 22594479
[eubankn@vm-qa-node001 slurm]$ generate_pbs_nodefile
/tmp/UUMuRd3rwj
[eubankn@vm-qa-node001 slurm]$ cat `generate_pbs_nodefile`
vm-qa-node001
vm-qa-node002

ChrisRackauckas · December 19, 2017, 7:08pm

Yeah, that looks appropriate if the compute nodes you’re supposed to be using are named vm-qa-node001 and vm-qa-node002.

nickeubank · December 19, 2017, 7:10pm

yeah… OK, well, thanks for confirming I’m not doing something super stupid! Have reservation at 5pm to actually check whether parallel jobs report those as their hostnames. Fingers crossed!

Topic		Replies	Views
How to get --machine-file for array tasks? Performance question	0	482	November 18, 2020
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3588	May 20, 2020
How to set environment variable in machine-file? Julia at Scale question	1	524	May 22, 2020
Setting up distributed workers on seperate nodes of cluster - PBS and OpenMPI Julia at Scale	10	1199	May 19, 2021
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2189	January 3, 2018

Debugging possible issue with `machinefile` option on SLURM system

Related topics