I am using a package written in Julia language, which is very new to me. I am using Julia v.1.7.2 on HPC cluster. I notice a very strange error which appears randomly when I submit multiple jobs with different parameters.
ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.7/HDF5/jl_d6InQ8.
I get this error for multiple packages even though in the log file I see :
I understand this is caused by multiple codes trying to access the base compiled file. I am not sure how to fix this issue.
If you run the following do you get a more extensive error message?
My suspicion is that LD_PRELOAD or LD_LIBRARY_PATH environment variables may be set causing Julia to load a libhdf5.so that it is not compatible with.
Thank you for your reply. I checked that I did not get any error message when I run the commands suggested by you.
I have now reinstalled everything again to see if it helps.
However, I am still getting the following warning messages:
Warning: Module HDF5 with build ID 18931272587474617 is missing from the cach e.
│ This may mean HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] does not support pr ecompilation but is imported by a module that does.
└ @ Base loading.jl:1325
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for ITensors [9136182c-28ba-11e9-034c-db9fb085ebd5]
│ exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab0 07a] failed to load from a cache file.
└ @ Base loading.jl:1349
I believe this sometimes leads to killing the job. I don’t understand how to fix this.
I think something got borked in your precompile cache. Delete the
/home/a/as42/.julia/compiled/v1.7 directory and try again.
If the other suggestions don’t work, you may need to load in the environment which gives support for
HDF5 in linux. I suggest looking for it like:
module spider hdf5
Can you elaborate on how to implement this?
I am still getting these errors on Julia 1.8.4.
I am submitting multiple jobs at once on HPC and I randomly get some this precompile error specific to HDF5 package.
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for HDF5_jll [0234f1f7-429e-5d53-9886- 15a909be8d59]
67 │ exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab007a] failed to load from a cache file.
└ @ Base loading.jl:1349
ERROR: LoadError: could not load symbol "H5open":
/home/as42/julia-1.8.4/bin/julia: undefined symbol: H5open
ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.8/HDF 5/jl_NIxl0D.
I don’t know whether my suggestion will work, but I assume that your cluster admins will have probably compiled HDF5 and made it available as a module that you can optionally load in. If you type
into your shell, you should see some options.
module spider allows you to search for a package. This command just checks whether your admins have made HDF5 available. On both clusters I have access to, there are multiple versions of HDF5 installed. Try to load the module with
module load HDF5/1.13.1
If this doesn’t work, type
module spider HDF5/1.13.1 (replace the name with the one/version on your system). This will tell you all the modules needed to load. On my system, I need 4 other packages:
module load GCC/11.3.0
module load OpenMPI/4.1.4
module load intel-compilers/2022.1.0
module load impi/2021.6.0
module load HDF5/1.13.1
Once you have these, put it in your slurm batch script before you run Julia, and all the workers should have access to the right modules. I think this should help.
Thank you for the nice explanation. I am not much aware of the module functionality.
When I type “module avail” on cluster, I get the following:
--------------------------------------------- /usr/share/modules/versions -------------------------------------------------
-------------------------------------------- /usr/share/modules/modulefiles ------------------------------------------------
dot module-git module-info modules null use.own
Have you tried
module spider hdf5?
Julia should use the BinaryBuilder-compiled version of libhd5 via HDF5_jll. I’d be surprised if anything would improve by trying to load some other hdf5 library.
@AS_92, to me this feels like you might have a precompilation race condition: some file is getting rewritten and that invalidates the cache for other packages. Just to clarify, if you execute the suggestion in Precompilation error - #2 by mkitti (sequentially in each environment you depend on) and then don’t touch your package environment(s), does that fix the problem?
If that doesn’t fix the problem, then the follow-up is: what packages are you using? If they’re doing some nasty pkg work that could explain it (and that would be a bug in the package(s) you’re using rather than Julia itself). One way to diagnose that kind of bad behavior might be to do the
precompile step as described above and then temporarily disable write permission recursively (all subdirectories and files) on your
~/.julia/compiled/v1.8 directory. Then you should get an immediate error if some package is going rogue.
Thank you for your reply. I am very new to Julia so I am not sure how to implement some of the things you have mentioned. It will be kind of you if you could help me with this.
I am not building a package in Julia, rather I am just using a package written in it. So, if I understand this correctly my environment is julia 1.8. When submit my codes on HPC, I am not initiating precompile. I am just adding the packages “using PKGNAME” in my code.jl.
I set the path for julia in my bash file do the packages already compiled in /.julia/compiled/1.8 are linked to the code.
I agree I am running into race condition, as the error appears when the code is running on the same core (I am using multithreading, its possible my nodes are competing) and hence it is appearing randomly.
Should I precompile the packages every time I run a code?
How do I disable the permission to write in the directory?
Normally everything should just work. However, there seems to be something funny going on in your particular combination of packages and environment.
The first thing to try is to enter package mode (type
] at the
julia> prompt) and type
precompile. After that finishes, hit backspace to go back to
julia> mode and try running your code.
If that doesn’t fix it, it suggests something strange is happening. I’m proposing that you debug it by turning off write permissions on your compiled-package cache: something like
chmod -R -w ~/.julia/compiled/v1.8, assuming you’re running unix. (Untested so you may have to experiment.) Now if you run your code, you might get a different error. Paste that error into this thread and perhaps it will shed light on what’s going wrong.
I tried to run the code using the bash file after removing the permissions:
I got the following error:
26 ERROR: LoadError: SystemError: mktemp: Permission denied
28  systemerror(p::Symbol, errno::Int32; extrainfo::Nothing)
29 @ Base ./error.jl:176
30  #systemerror#80
31 @ ./error.jl:175 [inlined]
32  systemerror
33 @ ./error.jl:175 [inlined]
34  mktemp(parent::String; cleanup::Bool)
35 @ Base.Filesystem ./file.jl:623
36  mktemp
37 @ ./file.jl:620 [inlined]
38  compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
39 @ Base ./loading.jl:1669
40  compilecache
41 @ ./loading.jl:1651 [inlined]
42  _require(pkg::Base.PkgId)
43 @ Base ./loading.jl:1337
44  _require_prelocked(uuidkey::Base.PkgId)
45 @ Base ./loading.jl:1200
46  macro expansion
47 @ ./loading.jl:1180 [inlined]
48  macro expansion
49 @ ./lock.jl:223 [inlined]
50  require(into::Module, mod::Symbol)
51 @ Base ./loading.jl:1144
Following some other suggestions on the discourse, I also tried doing the following thing:
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.precompile()'
julia -t $OMP_NUM_THREADS $ARG > out.dat
but I got the error:
29 ERROR: LoadError: SystemError: mktemp: Permission denied
31  systemerror(p::Symbol, errno::Int32; extrainfo::Nothing)
32 @ Base ./error.jl:176
33  #systemerror#80
34 @ ./error.jl:175 [inlined]
35  systemerror
36 @ ./error.jl:175 [inlined]
37  mktemp(parent::String; cleanup::Bool)
38 @ Base.Filesystem ./file.jl:623
39  mktemp
40 @ ./file.jl:620 [inlined]
41  compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
42 @ Base ./loading.jl:1669
43  compilecache
44 @ ./loading.jl:1651 [inlined]
45  _require(pkg::Base.PkgId)
46 @ Base ./loading.jl:1337
47  _require_prelocked(uuidkey::Base.PkgId)
48 @ Base ./loading.jl:1200
49  macro expansion
50 @ ./loading.jl:1180 [inlined]
51  macro expansion
52 @ ./lock.jl:223 [inlined]
53  require(into::Module, mod::Symbol)
54 @ Base ./loading.jl:1144
I tried to run the codes again after precompiling and giving the write permissions to compiled folder. I started a bunch of jobs using a bash script, I notice some of them are getting killed.
The screenshot below shows the difference in the log files of two codes which were started at the same time. The job in the left screen (theta_500) successfully finished while the one on the right was killed immediately. I notice both of them started at the same time on the same node. Additionally I notice, one major difference in the compiled folder ".julia/compiled/v1.8/HDF5_jll/ ", this seems to be the main cause of the error.
I also notice, the left job (theta_500) throws a warning error, but the job is still successfully finished.
I notice the compiled folder imported by different jobs are different. when HDF5_jll is missing, the job gets terminated immediately, but when it loads more files like in theta_100 and theta_200, I get errors in the later part of the code arising from a package I am using.
I am not sure what is going on. One possible way to start diagnosing is to start by adding just the packages that seem to fail frequently to your environment, and precompile them:
pkg> precompile HDF5
If those fail, what’s the error? If it fails, you can get a more expansive error with
julia> using HDF5
You may need to do this for each “proximal” cause of failure. Then perhaps you can precompile your larger package?
precompiling HDF5 doesn’t seem to throw any error.
I believe it throws the error only when two different codes on the same node tries to access the cache of HDF5 at the same time.
Access shouldn’t cause that problem (emphasis on shouldn’t). Two nodes trying to write the same cache file at the same time might. That’s why I wonder if precompiling serially might fix things.
One (slow) way to precompile everything might be to clear your
.julia/compiled/v1.x directory and then just start with
julia> using MyBigPkg without first using
pkg> precompile. That should force all the dependencies of
MyBigPkg to be precompiled one-at-a-time. You’ll have to be patient, as this can take a long time, but then perhaps your parallel job can work?
This might not work, though, if something in
MyBigPkg is doing
Pkg-level operations. It probably shouldn’t be doing that, so you might complain to the author if that seems to be what’s happening.
mktemp failing though? Is more than one process trying to create the same temporary file?
Could you just run a single process first before trying to launch multiple?
Another option might be trying to give each process its own
JULIA_DEPOT_PATH as a last resort.
mktemp failing though?
Because I had specifically asked him to turn off write permissions to see when a particular write step was occurring. There were additional writes after all the files were nominally precompiled, which is weird. `compilecache` failed when `@everywhere using` from remote machines · Issue #48217 · JuliaLang/julia · GitHub seems similar. There’s a bug somewhere but we’re still hunting for it.