I am running code via slurm on an HPC. Most of the time, the code compiles without issue. Sometimes, it throws errors such as:
[lhendri@longleaf-login2 CollegeStrat]$ cat ../log/hProdBounded.out
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
│ exception = Required dependency NearestNeighbors [b8a86587-4115-5ab1-83bc-aa920d37bbce] failed to load from a cache file.
└ @ Base loading.jl:1041
ERROR: LoadError: SystemError: opening file "/nas/longleaf/home/lhendri/.julia/compiled/v1.4/EconometricsLH/OoVDL_N14SI.ji": Permission denied
Stacktrace:
[1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
[2] #systemerror#50 at ./error.jl:167 [inlined]
[3] systemerror at ./error.jl:167 [inlined]
[4] open(::String; read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Bool) at ./iostream.jl:254
[5] open(::String, ::String) at ./iostream.jl:310
[6] open(::Base.var"#692#694", ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:296
[7] open at ./io.jl:296 [inlined]
[8] compilecache(::Base.PkgId, ::String) at ./loading.jl:1264
[9] _require(::Base.PkgId) at ./loading.jl:1029
[10] require(::Base.PkgId) at ./loading.jl:927
[11] require(::Module, ::Symbol) at ./loading.jl:922
[12] include(::Module, ::String) at ./Base.jl:377
[13] top-level scope at none:2
[14] eval at ./boot.jl:331 [inlined]
[15] eval(::Expr) at ./client.jl:449
[16] top-level scope at ./none:3
in expression starting at /nas/longleaf/home/lhendri/.julia/packages/ModelParams/j9xXf/src/ModelParams.jl:20
ERROR: LoadError: Failed to precompile ModelParams [4089ccbe-b1dc-5f86-a141-4606b18b4241] to /nas/longleaf/home/lhendri/.julia/compiled/v1.4/ModelParams/tRb7n_N14SI.ji.
[...]
When I resubmit the same code again, it often runs without issue.
How can I avoid these kinds of compile errors?
If you’re launching multiple jobs at once or using distributed parallelism (either via the Distributed stdlib or MPI), you might be hitting race conditions. Our solution has been to run:
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.precompile()'
I would be interested to hear if this solves the problem.
Also anything about the configuration of your HPC and the type of storage.
Maybe someone in the future will have a problem with a similar storage setup
I have one related question:
If one runs the precompile trick while existing runs are on-going, does it affect the existing runs?
I have multiple branches (with variant behaviours) of the same module, and the simulations take hours to days to run. I would like to know how best to switch branches, precompile, run job arrays on the current branch, all without impacting the already running jobs.