Using a machine file on a cluster but also propagating environment to the remote workers

Hi, I am trying to run some distributed code on a slurm cluster using an sbatch file as below:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --mem-per-cpu=4571
#SBATCH --time=24:00:00
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

# Create the machine file for Julia
JULIA_MACHINEFILE=machinefile-${SLURM_JOB_ID}
srun bash -c hostname > $JULIA_MACHINEFILE
sed -i 's/^[[:alnum:]]*/&-ib/g' $JULIA_MACHINEFILE

module purge
module load Julia/1.4.1-linux-x86_64

julia --machine-file ${JULIA_MACHINEFILE} \
    src/creditcard/run.jl \
    --dataset uci_heart \
    --label target \
    --epsilon 6.0 \
    --folds 5 \
    --distributed

My code looks like:

using Pkg; Pkg.activate(".")
using Distributed
using ArgParse
using ForwardDiff
using LinearAlgebra
using CSV
using DataFrames
using AdvancedHMC
...

@everywhere begin
    using Pkg; Pkg.activate(".")
    using Distributed
    using ArgParse
    using ForwardDiff
    using LinearAlgebra
    using CSV
    using DataFrames
    using AdvancedHMC
...

MAIN BODY OF CODE

However, when I submit by batch job, it always crashes out in a way that suggests the environment is not being set properly:

ERROR: LoadError: On worker 2:
ArgumentError: Package ArgParse [c7e460c6-2fb9-53a9-8c5b-16f535851c63] is required but does not seem to be installed:
 - Run `Pkg.instantiate()` to install all recorded dependencies.

_require at ./loading.jl:998
require at ./loading.jl:927
#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:78
#101 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:88
#94 at ./task.jl:358

...and 27 more exception(s).

Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:316
 [2] macro expansion at ./task.jl:335 [inlined]
 [3] _require_callback(::Base.PkgId) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:75
 [4] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [5] invokelatest at ./essentials.jl:711 [inlined]
 [6] require(::Base.PkgId) at ./loading.jl:930
 [7] require(::Module, ::Symbol) at ./loading.jl:922
 [8] include(::Module, ::String) at ./Base.jl:377
 [9] exec_options(::Base.JLOptions) at ./client.jl:288
 [10] _start() at ./client.jl:484

I’d guess this is some problem with setting Julia paths etc across the remote workers, would appreciate some help with this.

What happens when you change

using Pkg; Pkg.activate(".")

to

using Pkg; Pkg.activate("."); Pkg.instantiate()

in both places?

I’m not sure that it is related to your issue, but I found that I have to set the environment variable JULIA_PROJECT to path/to/environment in order for things to properly work on the SLURM cluster I’m working on when using multiple workers.
This should make the Pkg.activate(".") call redundant.

Have tried that, no change in the output unfortunately

Interesting, how do you set that environment variable? Should I do export JULIA_PROJECT path/to/env on the login node or something else?

I’m doing it as part of the sbatch script, it should be set on the compute node. But I’m not using machine files so maybe it’s different?
I just start the processes from within the Julia script with

using ClusterManagers
addprocs(SlurmManager(nworkers))

The is also the --project=path/to/environment flag that you can try using instead, I don’t remember if I tried that on my cluster (there are some older discussions here about workers and environment activation)

I’m by no means a Slurm expert, but why do you need the machine file in your script ? wouldn’t the process start by default on the compute node when you send this script to the queue via sbatch?

The machine file as I understand creates more processes for each remote worker associated with the sbatch, this sbatch results in the same errors as before:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --mem-per-cpu=4571
#SBATCH --time=24:00:00
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

# Create the machine file for Julia
JULIA_MACHINEFILE=machinefile-${SLURM_JOB_ID}
srun bash -c hostname > $JULIA_MACHINEFILE
sed -i 's/^[[:alnum:]]*/&-ib/g' $JULIA_MACHINEFILE

export JULIA_PROJECT=/home/dcs/csrxgb/julia_stuff/Project.toml

module purge
module load Julia/1.4.1-linux-x86_64

julia \
    --machine-file ${JULIA_MACHINEFILE} \
    src/creditcard/run.jl \
    --dataset uci_heart \
    --label target \
    --epsilon 6.0 \
    --folds 5 \
    --distributed

I will now try with your suggestion to use ClusterManagers

There is something different at least:

[login1 julia_stuff]$ cat slurm.cnode43.250472.err
ERROR: LoadError: LoadError: UndefVarError: @spawnat not defined
Stacktrace:
 [1] top-level scope
 [2] include(::Module, ::String) at ./Base.jl:377
 [3] exec_options(::Base.JLOptions) at ./client.jl:288
 [4] _start() at ./client.jl:484
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:5
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:4

Nevermind that was me not using Distributed early enough, the code works now! Amazing the only problem I have now is tracking progress through my distributed loop, I can see output from each process in a jobXXXX.out file, but I wrapped my main loop like so:

if distributed

        println("Distributing work...")
        p = Progress(total_steps)

        progress_pmap(1:total_steps, progress=p) do i

                CODE

However, I cannot see this in the error output or in the job output, any ideas?

Where is progress_pmap from? I use ProgressMeter.jl for progress reporting, but I actually only use it when running stuff interactively. Maybe you can try that package?

progress_pmap is from ProgressMeter.jl as well, I have tried both and neither seems to work. The strange thing is that I think I don’t get any output or error out in the files defined in my sbatch until the programme terminates / crashes, as it is running succesfully right now but none of my printlns in the code during experiment setup have been recorded in those files either.

Maybe you need to use flush somewhere to have the output written to the file? After the job ends you do see all the output from the println statements and so on?

Yep I do, so I have a main function that calls stuff like load_data, sets some constants, splits it up etc. and in between some print lns, so I am wondering if these don’t show up maybe neither will any loading bars?

Maybe try explicitly printing to a file from within Julia and calling flush(file_io)? I think you can also redirect the output of the progress bar when you create it with Progress, but I’m not really sure.

Ok will give these things a go, thank you very much for the help!

1 Like