Hi!
I am using a package written in Julia language, which is very new to me. I am using Julia v.1.7.2 on HPC cluster. I notice a very strange error which appears randomly when I submit multiple jobs with different parameters.
ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.7/HDF5/jl_d6InQ8.
I get this error for multiple packages even though in the log file I see :
.julia/compiled/v1.7/HDF5/
I understand this is caused by multiple codes trying to access the base compiled file. I am not sure how to fix this issue.
Hi!
Thank you for your reply. I checked that I did not get any error message when I run the commands suggested by you.
I have now reinstalled everything again to see if it helps.
However, I am still getting the following warning messages:
Warning: Module HDF5 with build ID 18931272587474617 is missing from the cach e.
│ This may mean HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] does not support pr ecompilation but is imported by a module that does.
â”” @ Base loading.jl:1325
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for ITensors [9136182c-28ba-11e9-034c-db9fb085ebd5]
│ exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab0 07a] failed to load from a cache file.
â”” @ Base loading.jl:1349
I believe this sometimes leads to killing the job. I don’t understand how to fix this.
Can you elaborate on how to implement this?
I am still getting these errors on Julia 1.8.4.
I am submitting multiple jobs at once on HPC and I randomly get some this precompile error specific to HDF5 package.
┌ Warning: The call to compilecache failed to create a usable precompiled cache file for HDF5_jll [0234f1f7-429e-5d53-9886- 15a909be8d59]
67 │ exception = Required dependency Zlib_jll [83775a58-1f1d-513f-b197-d71354ab007a] failed to load from a cache file.
â”” @ Base loading.jl:1349
ERROR: LoadError: could not load symbol "H5open":
/home/as42/julia-1.8.4/bin/julia: undefined symbol: H5open
ERROR: LoadError: Failed to precompile HDF5 [f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f] to /home/a/as42/.julia/compiled/v1.8/HDF 5/jl_NIxl0D.
I don’t know whether my suggestion will work, but I assume that your cluster admins will have probably compiled HDF5 and made it available as a module that you can optionally load in. If you type
module avail
into your shell, you should see some options. module spider allows you to search for a package. This command just checks whether your admins have made HDF5 available. On both clusters I have access to, there are multiple versions of HDF5 installed. Try to load the module with
module load HDF5/1.13.1
If this doesn’t work, type module spider HDF5/1.13.1 (replace the name with the one/version on your system). This will tell you all the modules needed to load. On my system, I need 4 other packages:
Once you have these, put it in your slurm batch script before you run Julia, and all the workers should have access to the right modules. I think this should help.
Julia should use the BinaryBuilder-compiled version of libhd5 via HDF5_jll. I’d be surprised if anything would improve by trying to load some other hdf5 library.
@AS_92, to me this feels like you might have a precompilation race condition: some file is getting rewritten and that invalidates the cache for other packages. Just to clarify, if you execute the suggestion in Precompilation error - #2 by mkitti (sequentially in each environment you depend on) and then don’t touch your package environment(s), does that fix the problem?
If that doesn’t fix the problem, then the follow-up is: what packages are you using? If they’re doing some nasty pkg work that could explain it (and that would be a bug in the package(s) you’re using rather than Julia itself). One way to diagnose that kind of bad behavior might be to do the precompile step as described above and then temporarily disable write permission recursively (all subdirectories and files) on your ~/.julia/compiled/v1.8 directory. Then you should get an immediate error if some package is going rogue.
Thank you for your reply. I am very new to Julia so I am not sure how to implement some of the things you have mentioned. It will be kind of you if you could help me with this.
I am not building a package in Julia, rather I am just using a package written in it. So, if I understand this correctly my environment is julia 1.8. When submit my codes on HPC, I am not initiating precompile. I am just adding the packages “using PKGNAME” in my code.jl.
I set the path for julia in my bash file do the packages already compiled in /.julia/compiled/1.8 are linked to the code.
I agree I am running into race condition, as the error appears when the code is running on the same core (I am using multithreading, its possible my nodes are competing) and hence it is appearing randomly.
Should I precompile the packages every time I run a code?
How do I disable the permission to write in the directory?
Normally everything should just work. However, there seems to be something funny going on in your particular combination of packages and environment.
The first thing to try is to enter package mode (type ] at the julia> prompt) and type precompile. After that finishes, hit backspace to go back to julia> mode and try running your code.
If that doesn’t fix it, it suggests something strange is happening. I’m proposing that you debug it by turning off write permissions on your compiled-package cache: something like chmod -R -w ~/.julia/compiled/v1.8, assuming you’re running unix. (Untested so you may have to experiment.) Now if you run your code, you might get a different error. Paste that error into this thread and perhaps it will shed light on what’s going wrong.
I tried to run the codes again after precompiling and giving the write permissions to compiled folder. I started a bunch of jobs using a bash script, I notice some of them are getting killed.
The screenshot below shows the difference in the log files of two codes which were started at the same time. The job in the left screen (theta_500) successfully finished while the one on the right was killed immediately. I notice both of them started at the same time on the same node. Additionally I notice, one major difference in the compiled folder ".julia/compiled/v1.8/HDF5_jll/ ", this seems to be the main cause of the error.
I also notice, the left job (theta_500) throws a warning error, but the job is still successfully finished.
I notice the compiled folder imported by different jobs are different. when HDF5_jll is missing, the job gets terminated immediately, but when it loads more files like in theta_100 and theta_200, I get errors in the later part of the code arising from a package I am using.
I am not sure what is going on. One possible way to start diagnosing is to start by adding just the packages that seem to fail frequently to your environment, and precompile them:
pkg> precompile HDF5
If those fail, what’s the error? If it fails, you can get a more expansive error with
julia> using HDF5
You may need to do this for each “proximal” cause of failure. Then perhaps you can precompile your larger package?
precompiling HDF5 doesn’t seem to throw any error.
I believe it throws the error only when two different codes on the same node tries to access the cache of HDF5 at the same time.
Access shouldn’t cause that problem (emphasis on shouldn’t). Two nodes trying to write the same cache file at the same time might. That’s why I wonder if precompiling serially might fix things.
One (slow) way to precompile everything might be to clear your .julia/compiled/v1.x directory and then just start with julia> using MyBigPkgwithout first using pkg> precompile. That should force all the dependencies of MyBigPkg to be precompiled one-at-a-time. You’ll have to be patient, as this can take a long time, but then perhaps your parallel job can work?
This might not work, though, if something in MyBigPkg is doing Pkg-level operations. It probably shouldn’t be doing that, so you might complain to the author if that seems to be what’s happening.