Code that works fine locally causes an error on a cluster

HarrisonWilde · May 14, 2020, 9:43am

Hi, so I have a function in my code to calculate the ROC AUC following an experiment, this has worked fine for me and continues to work fine locally but on our cluster I get the following error:

ERROR: LoadError: On worker 2:
BoundsError: attempt to access 1-element Array{Int64,1} at index [2]
getindex at ./array.jl:788 [inlined]
AUC at /home/dcs/csrxgb/.julia/packages/MLJBase/O5b6j/src/measures/finite.jl:409
roc_auc at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/evaluation.jl:28
#evalu#43 at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/evaluation.jl:6
evalu at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/evaluation.jl:2
#93 at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/run.jl:231
#49 at /home/dcs/csrxgb/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:795
#104 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294 [inlined]
#103 at ./task.jl:358
Stacktrace:
 [1] (::Base.var"#726#728")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#726#728", ::Array{Any,1}) at ./abstractarray.jl:1919
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#204#207"{WorkerPool}, ::Function, ::UnitRange{Int64}) at ./asyncmap.jl:154
 [5] async_usemap(::Distributed.var"#188#190"{Distributed.var"#188#189#191"{WorkerPool,ProgressMeter.var"#49#52"{var"#93#95"{String,String,Int64,Bool,String,String,Int64,Float64,Float64,Float64,Float64,Float64,Array{Tuple{Float64,Float64},1},Int64,Bool,Int64},RemoteChannel{Channel{Bool}}}}}, ::UnitRange{Int64}; ntasks::Function, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#710 at ./asyncmap.jl:81 [inlined]
 [7] pmap(::Function, ::WorkerPool, ::UnitRange{Int64}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Array{Any,1}, retry_check::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/pmap.jl:126
 [8] pmap(::Function, ::WorkerPool, ::UnitRange{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/pmap.jl:101
 [9] pmap(::Function, ::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/pmap.jl:156
 [10] pmap(::Function, ::UnitRange{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/pmap.jl:156
 [11] macro expansion at /home/dcs/csrxgb/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:794 [inlined]
 [12] macro expansion at ./task.jl:334 [inlined]
 [13] macro expansion at /home/dcs/csrxgb/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:793 [inlined]
 [14] macro expansion at ./task.jl:334 [inlined]
 [15] progress_map(::Function, ::Vararg{Any,N} where N; mapfun::Function, progress::Progress, channel_bufflen::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/dcs/csrxgb/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:786
 [16] #progress_pmap#53 at /home/dcs/csrxgb/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:811 [inlined]
 [17] main() at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/run.jl:116
 [18] top-level scope at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/run.jl:389
 [19] include(::Module, ::String) at ./Base.jl:377
 [20] exec_options(::Base.JLOptions) at ./client.jl:288
 [21] _start() at ./client.jl:484
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/run.jl:389

Any idea as to what might be the cause of this?

johnh · May 14, 2020, 10:05am

I am not going to be of much help here. How are you distributing the model across the cluster?

I note /home/dcs and /gpfs/home/dcs I imagine you have a GPFS parallel filesystem, and that /home/dcs is a link to /gpfs/home/dcs I sthat correct?
It should not cause any difficulties though.

HarrisonWilde · May 14, 2020, 10:13am

Roughly:

using ClusterManagers
using Distributed
# addprocs(SlurmManager(parse(Int, ENV["SLURM_NTASKS"])), o=string(ENV["SLURM_JOB_ID"]))
addprocs_slurm(parse(Int, ENV["SLURM_NTASKS"]))
println("We are all connected and ready.")
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    println(host, pid)
end
using ArgParse
using ForwardDiff
...

EXPERIMENT SETUP

pmap() do i
     run_experiment
     calc_auc
end

And yes I believe you are correct on the filesystem. My Sbatch script looks like this, I then sbatch it to the cluster’s queue from a login node

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --mem-per-cpu=4571
#SBATCH --time=48:00:00
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

# Create the machine file for Julia
# JULIA_MACHINEFILE=machinefile-${SLURM_JOB_ID}
# srun bash -c hostname > $JULIA_MACHINEFILE
# # srun -l /bin/hostname | sort -n | awk '{print $2}' > $JULIA_MACHINEFILE
# sed -i 's/^[[:alnum:]]*/&-ib/g' $JULIA_MACHINEFILE

export JULIA_PROJECT=/home/dcs/csrxgb/julia_stuff/Project.toml
export JULIA_CMDSTAN_HOME=/home/dcs/csrxgb/julia_stuff/cmdstan-2.23.0

module purge
module load GCC/8.2.0-2.31.1 GCCcore/8.2.0 Julia/1.4.1-linux-x86_64

# --machine-file ${JULIA_MACHINEFILE} \
julia src/logistic_regression/run.jl \
    --path /home/dcs/csrxgb/julia_stuff \
    --dataset uci_heart \
    --label target \
    --epsilon 6.0 \
    --iterations 10 \
    --folds 5 \
    --sampler AHMC \
    --no_shuffle \
    --distributed

johnh · May 14, 2020, 10:54am

I cannot solve the problem. I think you have to ask with pmap() will run MLJ in parallel.

One small bit of advice - when you have a code running locally, try running it in parallel on the cluster with only 1 process. Then 2, then 4 etc.
KISS

Topic		Replies	Views
Julia 1.0 Example of @distributed and pmap Julia at Scale	1	3826	August 26, 2018
"ERROR: LoadError: TaskFailedException: IOError: stream is closed or unusable" with @distributed on multinode cluster Julia at Scale cluster	3	1294	March 19, 2020
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2216	January 3, 2018
Issues using Stan.jl on a cluster to run embarrassingly parallelisable experiments Julia at Scale question	0	533	May 15, 2020
Struggling to figure out how I should use shared arrays on a slurm cluster using remote workers Julia at Scale question	11	1112	May 8, 2020

Code that works fine locally causes an error on a cluster

Related topics