using Pkg; Pkg.activate(".")
using Distributed
using ArgParse
using ForwardDiff
using LinearAlgebra
using CSV
using DataFrames
using AdvancedHMC
...
@everywhere begin
using Pkg; Pkg.activate(".")
using Distributed
using ArgParse
using ForwardDiff
using LinearAlgebra
using CSV
using DataFrames
using AdvancedHMC
...
MAIN BODY OF CODE
However, when I submit by batch job, it always crashes out in a way that suggests the environment is not being set properly:
ERROR: LoadError: On worker 2:
ArgumentError: Package ArgParse [c7e460c6-2fb9-53a9-8c5b-16f535851c63] is required but does not seem to be installed:
- Run `Pkg.instantiate()` to install all recorded dependencies.
_require at ./loading.jl:998
require at ./loading.jl:927
#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:78
#101 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:88
#94 at ./task.jl:358
...and 27 more exception(s).
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:316
[2] macro expansion at ./task.jl:335 [inlined]
[3] _require_callback(::Base.PkgId) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/Distributed.jl:75
[4] #invokelatest#1 at ./essentials.jl:712 [inlined]
[5] invokelatest at ./essentials.jl:711 [inlined]
[6] require(::Base.PkgId) at ./loading.jl:930
[7] require(::Module, ::Symbol) at ./loading.jl:922
[8] include(::Module, ::String) at ./Base.jl:377
[9] exec_options(::Base.JLOptions) at ./client.jl:288
[10] _start() at ./client.jl:484
I’d guess this is some problem with setting Julia paths etc across the remote workers, would appreciate some help with this.
I’m not sure that it is related to your issue, but I found that I have to set the environment variable JULIA_PROJECT to path/to/environment in order for things to properly work on the SLURM cluster I’m working on when using multiple workers.
This should make the Pkg.activate(".") call redundant.
I’m doing it as part of the sbatch script, it should be set on the compute node. But I’m not using machine files so maybe it’s different?
I just start the processes from within the Julia script with
using ClusterManagers
addprocs(SlurmManager(nworkers))
The is also the --project=path/to/environment flag that you can try using instead, I don’t remember if I tried that on my cluster (there are some older discussions here about workers and environment activation)
I’m by no means a Slurm expert, but why do you need the machine file in your script ? wouldn’t the process start by default on the compute node when you send this script to the queue via sbatch?
The machine file as I understand creates more processes for each remote worker associated with the sbatch, this sbatch results in the same errors as before:
[login1 julia_stuff]$ cat slurm.cnode43.250472.err
ERROR: LoadError: LoadError: UndefVarError: @spawnat not defined
Stacktrace:
[1] top-level scope
[2] include(::Module, ::String) at ./Base.jl:377
[3] exec_options(::Base.JLOptions) at ./client.jl:288
[4] _start() at ./client.jl:484
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:5
in expression starting at /gpfs/home/dcs/csrxgb/julia_stuff/src/creditcard/run.jl:4
Nevermind that was me not using Distributed early enough, the code works now! Amazing the only problem I have now is tracking progress through my distributed loop, I can see output from each process in a jobXXXX.out file, but I wrapped my main loop like so:
if distributed
println("Distributing work...")
p = Progress(total_steps)
progress_pmap(1:total_steps, progress=p) do i
CODE
However, I cannot see this in the error output or in the job output, any ideas?
Where is progress_pmap from? I use ProgressMeter.jl for progress reporting, but I actually only use it when running stuff interactively. Maybe you can try that package?
progress_pmap is from ProgressMeter.jl as well, I have tried both and neither seems to work. The strange thing is that I think I don’t get any output or error out in the files defined in my sbatch until the programme terminates / crashes, as it is running succesfully right now but none of my printlns in the code during experiment setup have been recorded in those files either.
Maybe you need to use flush somewhere to have the output written to the file? After the job ends you do see all the output from the println statements and so on?
Yep I do, so I have a main function that calls stuff like load_data, sets some constants, splits it up etc. and in between some print lns, so I am wondering if these don’t show up maybe neither will any loading bars?
Maybe try explicitly printing to a file from within Julia and calling flush(file_io)? I think you can also redirect the output of the progress bar when you create it with Progress, but I’m not really sure.