Precompilation error

Hi!
I am using a package written in Julia language, which is very new to me. I am using Julia v.1.7.2 on HPC cluster. I notice a very strange error which appears randomly when I submit multiple jobs with different parameters.

ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.7/HDF5/jl_d6InQ8.

I get this error for multiple packages even though in the log file I see :

.julia/compiled/v1.7/HDF5/

I understand this is caused by multiple codes trying to access the base compiled file. I am not sure how to fix this issue.

If you run the following do you get a more extensive error message?

using Pkg
Pkg.precompile()

My suspicion is that LD_PRELOAD or LD_LIBRARY_PATH environment variables may be set causing Julia to load a libhdf5.so that it is not compatible with.

1 Like

Hi!
Thank you for your reply. I checked that I did not get any error message when I run the commands suggested by you.
I have now reinstalled everything again to see if it helps.
However, I am still getting the following warning messages:

Warning: Module HDF5 with build ID 18931272587474617 is missing from the cach e.
│ This may mean HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] does not support pr ecompilation but is imported by a module that does.
â”” @ Base loading.jl:1325
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for ITensors [9136182c-28ba-11e9-034c-db9fb085ebd5]
│ exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab0 07a] failed to load from a cache file.
â”” @ Base loading.jl:1349
I believe this sometimes leads to killing the job. I don’t understand how to fix this.

I think something got borked in your precompile cache. Delete the /home/a/as42/.julia/compiled/v1.7 directory and try again.

1 Like

If the other suggestions don’t work, you may need to load in the environment which gives support for HDF5 in linux. I suggest looking for it like:

module spider hdf5

Can you elaborate on how to implement this?
I am still getting these errors on Julia 1.8.4.
I am submitting multiple jobs at once on HPC and I randomly get some this precompile error specific to HDF5 package.

┌ Warning: The call to compilecache failed to create a usable precompiled cache file for HDF5_jll [0234f1f7-429e-5d53-9886-    15a909be8d59]
 67 │   exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab007a] failed to load from a cache file.
â”” @ Base loading.jl:1349
ERROR: LoadError: could not load symbol "H5open":
/home/as42/julia-1.8.4/bin/julia: undefined symbol: H5open
 

ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.8/HDF 5/jl_NIxl0D.


I don’t know whether my suggestion will work, but I assume that your cluster admins will have probably compiled HDF5 and made it available as a module that you can optionally load in. If you type

module avail

into your shell, you should see some options. module spider allows you to search for a package. This command just checks whether your admins have made HDF5 available. On both clusters I have access to, there are multiple versions of HDF5 installed. Try to load the module with

module load HDF5/1.13.1

If this doesn’t work, type module spider HDF5/1.13.1 (replace the name with the one/version on your system). This will tell you all the modules needed to load. On my system, I need 4 other packages:

module load GCC/11.3.0
module load OpenMPI/4.1.4 
module load intel-compilers/2022.1.0 
module load impi/2021.6.0
module load HDF5/1.13.1

Once you have these, put it in your slurm batch script before you run Julia, and all the workers should have access to the right modules. I think this should help.

1 Like

Thank you for the nice explanation. I am not much aware of the module functionality.
When I type “module avail” on cluster, I get the following:

--------------------------------------------- /usr/share/modules/versions -------------------------------------------------
3.2.10

-------------------------------------------- /usr/share/modules/modulefiles ------------------------------------------------
dot         module-git  module-info modules     null        use.own
e

Have you tried module spider hdf5?

Julia should use the BinaryBuilder-compiled version of libhd5 via HDF5_jll. I’d be surprised if anything would improve by trying to load some other hdf5 library.

@AS_92, to me this feels like you might have a precompilation race condition: some file is getting rewritten and that invalidates the cache for other packages. Just to clarify, if you execute the suggestion in Precompilation error - #2 by mkitti (sequentially in each environment you depend on) and then don’t touch your package environment(s), does that fix the problem?

If that doesn’t fix the problem, then the follow-up is: what packages are you using? If they’re doing some nasty pkg work that could explain it (and that would be a bug in the package(s) you’re using rather than Julia itself). One way to diagnose that kind of bad behavior might be to do the precompile step as described above and then temporarily disable write permission recursively (all subdirectories and files) on your ~/.julia/compiled/v1.8 directory. Then you should get an immediate error if some package is going rogue.

2 Likes

Thank you for your reply. I am very new to Julia so I am not sure how to implement some of the things you have mentioned. It will be kind of you if you could help me with this.

I am not building a package in Julia, rather I am just using a package written in it. So, if I understand this correctly my environment is julia 1.8. When submit my codes on HPC, I am not initiating precompile. I am just adding the packages “using PKGNAME” in my code.jl.
I set the path for julia in my bash file do the packages already compiled in /.julia/compiled/1.8 are linked to the code.

I agree I am running into race condition, as the error appears when the code is running on the same core (I am using multithreading, its possible my nodes are competing) and hence it is appearing randomly.

Should I precompile the packages every time I run a code?
How do I disable the permission to write in the directory?

Normally everything should just work. However, there seems to be something funny going on in your particular combination of packages and environment.

The first thing to try is to enter package mode (type ] at the julia> prompt) and type precompile. After that finishes, hit backspace to go back to julia> mode and try running your code.

If that doesn’t fix it, it suggests something strange is happening. I’m proposing that you debug it by turning off write permissions on your compiled-package cache: something like chmod -R -w ~/.julia/compiled/v1.8, assuming you’re running unix. (Untested so you may have to experiment.) Now if you run your code, you might get a different error. Paste that error into this thread and perhaps it will shed light on what’s going wrong.

1 Like

I tried to run the code using the bash file after removing the permissions:
I got the following error:

 26 ERROR: LoadError: SystemError: mktemp: Permission denied
 27 Stacktrace:
 28   [1] systemerror(p::Symbol, errno::Int32; extrainfo::Nothing)
 29     @ Base ./error.jl:176
 30   [2] #systemerror#80
 31     @ ./error.jl:175 [inlined]
 32   [3] systemerror
 33     @ ./error.jl:175 [inlined]
 34   [4] mktemp(parent::String; cleanup::Bool)
 35     @ Base.Filesystem ./file.jl:623
 36   [5] mktemp
 37     @ ./file.jl:620 [inlined]
 38   [6] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
 39     @ Base ./loading.jl:1669
 40   [7] compilecache
 41     @ ./loading.jl:1651 [inlined]
 42   [8] _require(pkg::Base.PkgId)
 43     @ Base ./loading.jl:1337
 44   [9] _require_prelocked(uuidkey::Base.PkgId)
 45     @ Base ./loading.jl:1200
 46  [10] macro expansion
 47     @ ./loading.jl:1180 [inlined]
 48  [11] macro expansion
 49     @ ./lock.jl:223 [inlined]
 50  [12] require(into::Module, mod::Symbol)
 51     @ Base ./loading.jl:1144

Following some other suggestions on the discourse, I also tried doing the following thing:

julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.precompile()'
julia -t $OMP_NUM_THREADS $ARG > out.dat

but I got the error:

29 ERROR: LoadError: SystemError: mktemp: Permission denied
 30 Stacktrace:
 31   [1] systemerror(p::Symbol, errno::Int32; extrainfo::Nothing)
 32     @ Base ./error.jl:176
 33   [2] #systemerror#80
 34     @ ./error.jl:175 [inlined]
 35   [3] systemerror
 36     @ ./error.jl:175 [inlined]
 37   [4] mktemp(parent::String; cleanup::Bool)
 38     @ Base.Filesystem ./file.jl:623
 39   [5] mktemp
 40     @ ./file.jl:620 [inlined]
 41   [6] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
 42     @ Base ./loading.jl:1669
 43   [7] compilecache
 44     @ ./loading.jl:1651 [inlined]
 45   [8] _require(pkg::Base.PkgId)
 46     @ Base ./loading.jl:1337
 47   [9] _require_prelocked(uuidkey::Base.PkgId)
 48     @ Base ./loading.jl:1200
 49  [10] macro expansion
 50     @ ./loading.jl:1180 [inlined]
 51  [11] macro expansion
 52     @ ./lock.jl:223 [inlined]
 53  [12] require(into::Module, mod::Symbol)
 54     @ Base ./loading.jl:1144

I tried to run the codes again after precompiling and giving the write permissions to compiled folder. I started a bunch of jobs using a bash script, I notice some of them are getting killed.
The screenshot below shows the difference in the log files of two codes which were started at the same time. The job in the left screen (theta_500) successfully finished while the one on the right was killed immediately. I notice both of them started at the same time on the same node. Additionally I notice, one major difference in the compiled folder ".julia/compiled/v1.8/HDF5_jll/ ", this seems to be the main cause of the error.
I also notice, the left job (theta_500) throws a warning error, but the job is still successfully finished.

I notice the compiled folder imported by different jobs are different. when HDF5_jll is missing, the job gets terminated immediately, but when it loads more files like in theta_100 and theta_200, I get errors in the later part of the code arising from a package I am using.

I am not sure what is going on. One possible way to start diagnosing is to start by adding just the packages that seem to fail frequently to your environment, and precompile them:

pkg> precompile HDF5

If those fail, what’s the error? If it fails, you can get a more expansive error with

julia> using HDF5

You may need to do this for each “proximal” cause of failure. Then perhaps you can precompile your larger package?

precompiling HDF5 doesn’t seem to throw any error.
I believe it throws the error only when two different codes on the same node tries to access the cache of HDF5 at the same time.

Access shouldn’t cause that problem (emphasis on shouldn’t). Two nodes trying to write the same cache file at the same time might. That’s why I wonder if precompiling serially might fix things.

One (slow) way to precompile everything might be to clear your .julia/compiled/v1.x directory and then just start with julia> using MyBigPkg without first using pkg> precompile. That should force all the dependencies of MyBigPkg to be precompiled one-at-a-time. You’ll have to be patient, as this can take a long time, but then perhaps your parallel job can work?

This might not work, though, if something in MyBigPkg is doing Pkg-level operations. It probably shouldn’t be doing that, so you might complain to the author if that seems to be what’s happening.

Why is mktemp failing though? Is more than one process trying to create the same temporary file?

Could you just run a single process first before trying to launch multiple?

Another option might be trying to give each process its own JULIA_DEPOT_PATH as a last resort.

Why is mktemp failing though?

Because I had specifically asked him to turn off write permissions to see when a particular write step was occurring. There were additional writes after all the files were nominally precompiled, which is weird. `compilecache` failed when `@everywhere using` from remote machines · Issue #48217 · JuliaLang/julia · GitHub seems similar. There’s a bug somewhere but we’re still hunting for it.