Using a machine file on a cluster but also propagating environment to the remote workers

HarrisonWilde · May 7, 2020, 11:38am

Hi, I am trying to run some distributed code on a slurm cluster using an sbatch file as below:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --mem-per-cpu=4571
#SBATCH --time=24:00:00
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

# Create the machine file for Julia
JULIA_MACHINEFILE=machinefile-${SLURM_JOB_ID}
srun bash -c hostname > $JULIA_MACHINEFILE
sed -i 's/^[[:alnum:]]*/&-ib/g' $JULIA_MACHINEFILE

module purge
module load Julia/1.4.1-linux-x86_64

julia --machine-file ${JULIA_MACHINEFILE} \
    src/creditcard/run.jl \
    --dataset uci_heart \
    --label target \
    --epsilon 6.0 \
    --folds 5 \
    --distributed

My code looks like:

using Pkg; Pkg.activate(".")
using Distributed
using ArgParse
using ForwardDiff
using LinearAlgebra
using CSV
using DataFrames
using AdvancedHMC
...

@everywhere begin
    using Pkg; Pkg.activate(".")
    using Distributed
    using ArgParse
    using ForwardDiff
    using LinearAlgebra
    using CSV
    using DataFrames
    using AdvancedHMC
...

MAIN BODY OF CODE

However, when I submit by batch job, it always crashes out in a way that suggests the environment is not being set properly:

ERROR: LoadError: On worker 2:
ArgumentError: Package ArgParse [c7e460c6-2fb9-53a9-8c5b-16f535851c63] is required but does not seem to be installed:
 - Run `Pkg.instantiate()` to install all recorded dependencies.

_require at ./loading.jl:998
require at ./loading.jl:927
#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:78
#101 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:88
#94 at ./task.jl:358

...and 27 more exception(s).

Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:316
 [2] macro expansion at ./task.jl:335 [inlined]
 [3] _require_callback(::Base.PkgId) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:75
 [4] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [5] invokelatest at ./essentials.jl:711 [inlined]
 [6] require(::Base.PkgId) at ./loading.jl:930
 [7] require(::Module, ::Symbol) at ./loading.jl:922
 [8] include(::Module, ::String) at ./Base.jl:377
 [9] exec_options(::Base.JLOptions) at ./client.jl:288
 [10] _start() at ./client.jl:484

I’d guess this is some problem with setting Julia paths etc across the remote workers, would appreciate some help with this.

pixel27 · May 7, 2020, 1:38pm

What happens when you change

using Pkg; Pkg.activate(".")

to

using Pkg; Pkg.activate("."); Pkg.instantiate()

in both places?

orialb · May 7, 2020, 1:48pm

I’m not sure that it is related to your issue, but I found that I have to set the environment variable JULIA_PROJECT to path/to/environment in order for things to properly work on the SLURM cluster I’m working on when using multiple workers.
This should make the Pkg.activate(".") call redundant.

HarrisonWilde · May 7, 2020, 1:50pm

Have tried that, no change in the output unfortunately

HarrisonWilde · May 7, 2020, 1:51pm

Interesting, how do you set that environment variable? Should I do export JULIA_PROJECT path/to/env on the login node or something else?

orialb · May 7, 2020, 1:57pm

I’m doing it as part of the sbatch script, it should be set on the compute node. But I’m not using machine files so maybe it’s different?
I just start the processes from within the Julia script with

using ClusterManagers
addprocs(SlurmManager(nworkers))

The is also the --project=path/to/environment flag that you can try using instead, I don’t remember if I tried that on my cluster (there are some older discussions here about workers and environment activation)

orialb · May 7, 2020, 2:01pm

I’m by no means a Slurm expert, but why do you need the machine file in your script ? wouldn’t the process start by default on the compute node when you send this script to the queue via sbatch?

HarrisonWilde · May 7, 2020, 2:39pm

The machine file as I understand creates more processes for each remote worker associated with the sbatch, this sbatch results in the same errors as before:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --mem-per-cpu=4571
#SBATCH --time=24:00:00
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

# Create the machine file for Julia
JULIA_MACHINEFILE=machinefile-${SLURM_JOB_ID}
srun bash -c hostname > $JULIA_MACHINEFILE
sed -i 's/^[[:alnum:]]*/&-ib/g' $JULIA_MACHINEFILE

export JULIA_PROJECT=/home/dcs/csrxgb/julia_stuff/Project.toml

module purge
module load Julia/1.4.1-linux-x86_64

julia \
    --machine-file ${JULIA_MACHINEFILE} \
    src/creditcard/run.jl \
    --dataset uci_heart \
    --label target \
    --epsilon 6.0 \
    --folds 5 \
    --distributed

I will now try with your suggestion to use ClusterManagers

HarrisonWilde · May 7, 2020, 2:45pm

There is something different at least:

[login1 julia_stuff]$ cat slurm.cnode43.250472.err
ERROR: LoadError: LoadError: UndefVarError: @spawnat not defined
Stacktrace:
 [1] top-level scope
 [2] include(::Module, ::String) at ./Base.jl:377
 [3] exec_options(::Base.JLOptions) at ./client.jl:288
 [4] _start() at ./client.jl:484
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:5
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:4

HarrisonWilde · May 7, 2020, 3:03pm

Nevermind that was me not using Distributed early enough, the code works now! Amazing the only problem I have now is tracking progress through my distributed loop, I can see output from each process in a jobXXXX.out file, but I wrapped my main loop like so:

if distributed

        println("Distributing work...")
        p = Progress(total_steps)

        progress_pmap(1:total_steps, progress=p) do i

                CODE

However, I cannot see this in the error output or in the job output, any ideas?

orialb · May 7, 2020, 3:19pm

Where is progress_pmap from? I use ProgressMeter.jl for progress reporting, but I actually only use it when running stuff interactively. Maybe you can try that package?

HarrisonWilde · May 7, 2020, 3:23pm

progress_pmap is from ProgressMeter.jl as well, I have tried both and neither seems to work. The strange thing is that I think I don’t get any output or error out in the files defined in my sbatch until the programme terminates / crashes, as it is running succesfully right now but none of my printlns in the code during experiment setup have been recorded in those files either.

orialb · May 7, 2020, 3:29pm

Maybe you need to use flush somewhere to have the output written to the file? After the job ends you do see all the output from the println statements and so on?

HarrisonWilde · May 7, 2020, 3:32pm

Yep I do, so I have a main function that calls stuff like load_data, sets some constants, splits it up etc. and in between some print lns, so I am wondering if these don’t show up maybe neither will any loading bars?

orialb · May 7, 2020, 3:46pm

Maybe try explicitly printing to a file from within Julia and calling flush(file_io)? I think you can also redirect the output of the progress bar when you create it with Progress, but I’m not really sure.

HarrisonWilde · May 7, 2020, 3:49pm

Ok will give these things a go, thank you very much for the help!

Topic		Replies	Views
Trivial question about workers on a cluster Performance question , package , parallel	1	386	April 3, 2021
How to run Julia on Cluster? Julia at Scale question , package , cluster	11	5536	March 16, 2021
How to set environment variable in machine-file? Julia at Scale question	1	530	May 22, 2020
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2216	January 3, 2018
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	15087	March 4, 2020

Using a machine file on a cluster but also propagating environment to the remote workers

Related topics