Compile errors on HPC

I am running code via slurm on an HPC. Most of the time, the code compiles without issue. Sometimes, it throws errors such as:

[lhendri@longleaf-login2 CollegeStrat]$ cat ../log/hProdBounded.out
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
│   exception = Required dependency NearestNeighbors [b8a86587-4115-5ab1-83bc-aa920d37bbce] failed to load from a cache file.
└ @ Base loading.jl:1041
ERROR: LoadError: SystemError: opening file "/nas/longleaf/home/lhendri/.julia/compiled/v1.4/EconometricsLH/OoVDL_N14SI.ji": Permission denied
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
 [2] #systemerror#50 at ./error.jl:167 [inlined]
 [3] systemerror at ./error.jl:167 [inlined]
 [4] open(::String; read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Bool) at ./iostream.jl:254
 [5] open(::String, ::String) at ./iostream.jl:310
 [6] open(::Base.var"#692#694", ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:296
 [7] open at ./io.jl:296 [inlined]
 [8] compilecache(::Base.PkgId, ::String) at ./loading.jl:1264
 [9] _require(::Base.PkgId) at ./loading.jl:1029
 [10] require(::Base.PkgId) at ./loading.jl:927
 [11] require(::Module, ::Symbol) at ./loading.jl:922
 [12] include(::Module, ::String) at ./Base.jl:377
 [13] top-level scope at none:2
 [14] eval at ./boot.jl:331 [inlined]
 [15] eval(::Expr) at ./client.jl:449
 [16] top-level scope at ./none:3
in expression starting at /nas/longleaf/home/lhendri/.julia/packages/ModelParams/j9xXf/src/ModelParams.jl:20
ERROR: LoadError: Failed to precompile ModelParams [4089ccbe-b1dc-5f86-a141-4606b18b4241] to /nas/longleaf/home/lhendri/.julia/compiled/v1.4/ModelParams/tRb7n_N14SI.ji.
[...]

When I resubmit the same code again, it often runs without issue.
How can I avoid these kinds of compile errors?

There is a hint that this is a network-attached-storage file system. These are tricky to write to reliably.

There was a package here which had the fix for race conditions when running on several compute nodes sharing a filesystem… Pkglock

May I Ask if you are running on multiple compute servers?

1 Like

The jobs run on a single node, multiple cores.

You are saying this can create race conditions at compile time?

If you’re launching multiple jobs at once or using distributed parallelism (either via the Distributed stdlib or MPI), you might be hitting race conditions. Our solution has been to run:

julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.precompile()'

before launching any jobs.

2 Likes

Is this with julia 1.5.1? Shouldn’t https://github.com/JuliaLang/julia/pull/36416 help here?

Thank you for those suggestions. Our HPC is still on 1.4 for now.

I will mark @simonbyrne’s answer as the solution while I wait for 1.5.2 to get installed.

I would be interested to hear if this solves the problem.
Also anything about the configuration of your HPC and the type of storage.
Maybe someone in the future will have a problem with a similar storage setup

As for the configuration, all the details are at

I will experiment and report back. Thanks again.

1 Like

Reporting back: So far the precompile trick has done the job. No more compile errors as the jobs are dispatched.

Thanks again.

4 Likes

I have one related question:
If one runs the precompile trick while existing runs are on-going, does it affect the existing runs?

I have multiple branches (with variant behaviours) of the same module, and the simulations take hours to days to run. I would like to know how best to switch branches, precompile, run job arrays on the current branch, all without impacting the already running jobs.

Many thanks!