Repeated precompilation on a cluster makes life difficult

I’m using julia v1.9.0-beta3 on a SLURM cluster. On a login node with versioninfo

julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores

if I run

OMP_NUM_THREADS=1 julia --project --startup=no -e "import Pkg; Pkg.instantiate(); Pkg.precompile()"

this precompiles dependencies in my project (110 overall, so this takes a while). After this, I log in to a compute node with versioninfo

julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 40 virtual cores

and run

julia> import Pkg; Pkg.precompile()
Precompiling environment...
  110 dependencies successfully precompiled in 541 seconds. 8 already precompiled.

This seems to precompile dependencies again, which is quite time-consuming.

Once this is over, though, if I return to the login node, and run

julia> import Pkg; Pkg.precompile()
Precompiling environment...
  110 dependencies successfully precompiled in 443 seconds. 8 already precompiled.

this restarts the precompilation process all over again! Why is everything being precompiled repeatedly, when nothing has changed in the dependencies or the source? The cycle continues if I try it on the compute node subsequently.

I have tried deleting the compile cache, which didn’t change anything. Ideally, Julia would need to precompile dependencies only once on each architecture.

3 Likes

Is the path of Julia the same between compute and login node? If not, that’d invalidate the cache. But having to always precompile the environment is definitely not how this is supposed to work. You can try setting JULIA_DEBUG=loading to try and understand why the cache is being invalidated.

1 Like

Thanks for that suggestion! The path of the Julia executable is the same on both nodes, but I find a bunch of messages similar to the following when precompilation restarts:

┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.9/Combinatorics/AwRuT_oTZDM.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2641

I see the same issue on a freshly installed nightly as well

On login node
CPU: 64 × AMD EPYC 7742 64-Core Processor

and on the compute node
CPU: 40 × Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

An expert will be along in a minute to enlighten us.
Looks to me like the compiler targets are different

I’ve received some help on the issue that I’ve created about this, but it does seem to be related to different compiler targets

This is an interesting problem. In general in HPC you would build/precompile an optimised binary for the target architecture you intend to run it on.
In the old days we would architect a system where login/compile nodes have the same CPU families.
These days with things changing so fast, and also virtualised / containerised servers you can’t rely on that rather naive assumption.

What is the relevant JULIA_CPU_TARGET for your Xeon CPU?
Googling - I saw a reference to archspec from the Spack folks (I heard Todd Gamblin talk at FOSDEM on Friday)
Off topic question - does Archspec help us choose JULIA_CPU_TARGET ?

Unfortunately, I don’t know enough about CPU targets. How does one find this out?

Did it also happen on Julia 1.8? The Pluto package has/had it too. I guessed it was caused by one of the dependencies but never figured it out.

This does not happen on v1.8, but happens on v1.9 onwards. Perhaps this is related to native code caching?

1 Like

yeah, something is funky with Julia 1.9 and above

Setting the environment variable export JULIA_CPU_TARGET=generic before precompiling seems to resolve this issue, as suggested in Package Images · The Julia Language

A better value for this specific issue might be export JULIA_CPU_TARGET="generic;skylake-avx512,clone_all;znver2,clone_all"

7 Likes

Yes, you compiled native code for an incompatible Instruction Set Architecture (ISA) and so when you try and load the cache the file is rejected because it’s probably using illegal instructions for your current CPU. By using multi-versioning you circumvent the issue by compiling the code for multiple ISAs.

1 Like