Compile errors on HPC

hendri54 · September 25, 2020, 2:09pm

I am running code via slurm on an HPC. Most of the time, the code compiles without issue. Sometimes, it throws errors such as:

[lhendri@longleaf-login2 CollegeStrat]$ cat ../log/hProdBounded.out
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
│   exception = Required dependency NearestNeighbors [b8a86587-4115-5ab1-83bc-aa920d37bbce] failed to load from a cache file.
└ @ Base loading.jl:1041
ERROR: LoadError: SystemError: opening file "/nas/longleaf/home/lhendri/.julia/compiled/v1.4/EconometricsLH/OoVDL_N14SI.ji": Permission denied
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
 [2] #systemerror#50 at ./error.jl:167 [inlined]
 [3] systemerror at ./error.jl:167 [inlined]
 [4] open(::String; read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Bool) at ./iostream.jl:254
 [5] open(::String, ::String) at ./iostream.jl:310
 [6] open(::Base.var"#692#694", ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:296
 [7] open at ./io.jl:296 [inlined]
 [8] compilecache(::Base.PkgId, ::String) at ./loading.jl:1264
 [9] _require(::Base.PkgId) at ./loading.jl:1029
 [10] require(::Base.PkgId) at ./loading.jl:927
 [11] require(::Module, ::Symbol) at ./loading.jl:922
 [12] include(::Module, ::String) at ./Base.jl:377
 [13] top-level scope at none:2
 [14] eval at ./boot.jl:331 [inlined]
 [15] eval(::Expr) at ./client.jl:449
 [16] top-level scope at ./none:3
in expression starting at /nas/longleaf/home/lhendri/.julia/packages/ModelParams/j9xXf/src/ModelParams.jl:20
ERROR: LoadError: Failed to precompile ModelParams [4089ccbe-b1dc-5f86-a141-4606b18b4241] to /nas/longleaf/home/lhendri/.julia/compiled/v1.4/ModelParams/tRb7n_N14SI.ji.
[...]

When I resubmit the same code again, it often runs without issue.
How can I avoid these kinds of compile errors?

PetrKryslUCSD · September 25, 2020, 3:40pm

There is a hint that this is a network-attached-storage file system. These are tricky to write to reliably.

johnh · September 25, 2020, 3:46pm

There was a package here which had the fix for race conditions when running on several compute nodes sharing a filesystem… Pkglock

https://github.com/simonbyrne/PkgLock.jl

May I Ask if you are running on multiple compute servers?

hendri54 · September 25, 2020, 5:18pm

The jobs run on a single node, multiple cores.

You are saying this can create race conditions at compile time?

simonbyrne · September 25, 2020, 5:19pm

If you’re launching multiple jobs at once or using distributed parallelism (either via the Distributed stdlib or MPI), you might be hitting race conditions. Our solution has been to run:

julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.precompile()'

before launching any jobs.

antoine-levitt · September 25, 2020, 5:42pm

Is this with julia 1.5.1? Shouldn’t https://github.com/JuliaLang/julia/pull/36416 help here?

hendri54 · September 25, 2020, 6:37pm

Thank you for those suggestions. Our HPC is still on 1.4 for now.

I will mark @simonbyrne’s answer as the solution while I wait for 1.5.2 to get installed.

johnh · September 26, 2020, 2:52am

I would be interested to hear if this solves the problem.
Also anything about the configuration of your HPC and the type of storage.
Maybe someone in the future will have a problem with a similar storage setup

hendri54 · September 26, 2020, 1:04pm

As for the configuration, all the details are at

https://its.unc.edu/research-computing/techdocs/getting-started-on-longleaf/#System%20Information

I will experiment and report back. Thanks again.

hendri54 · October 1, 2020, 2:46pm

Reporting back: So far the precompile trick has done the job. No more compile errors as the jobs are dispatched.

Thanks again.

Dr.Merkwuedigliebe · January 2, 2022, 5:04pm

I have one related question:
If one runs the precompile trick while existing runs are on-going, does it affect the existing runs?

I have multiple branches (with variant behaviours) of the same module, and the simulations take hours to days to run. I would like to know how best to switch branches, precompile, run job arrays on the current branch, all without impacting the already running jobs.

Many thanks!

erny123 · March 11, 2024, 4:55pm

This seems to still be a problem.

I’m on Julia 1.10.2 and having the same issues as all the rest of the cache race conditions.

Your solution is a bit ambiguous. How is this solving the race conditions?

After this is it safe to run mpiexec -n 20 julia ./script.jl ?

In script.jl , is it save to do:

using Pkg
Pkg.activate(".")
using MPI
using HDF5
using PencilArrays
.
.
.

Topic		Replies	Views
Precompilation error using HPC New to Julia	20	5885	August 18, 2020
Always encounter this error after installing new packages on HPC? Julia at Scale question	2	1000	August 28, 2020
Precompilation error New to Julia package , precompilation	18	6057	January 20, 2023
HDF5 build failure on HPC New to Julia question , package	4	573	March 11, 2021
Got good results for some tasks and got an error for other tasks in a job array? Julia at Scale question	2	860	April 5, 2020

Compile errors on HPC

Related topics