Julia 1.9, same depot with different machines?

Sevi · September 2, 2023, 8:40am

Hi! I’m using Julia 1.9 on an HPC cluster and ran into a strange error on some of the machines. I have a hunch that it’s related to native code caching and the fact that I currently use the same depot for all machines, but I’m a bit stuck investigating the actual issue and finding a good solution.

Problem

Running certain code on the affected machines gives an Illegal instruction error. So far I could only trigger the error when using Cthulhu.jl, but I’m not sure if it’s really related to that package itself (see below for the steps to reproduce and the machine specs).

When I first precompile on machine A and then run the example on machine B, Julia crashes with an Illegal instruction error.

Error details

julia> @descend sort([5,4,3])
Invalid instruction at 0x14f38d0600f2: 0xc5, 0xfc, 0x46, 0xc8, 0xc5, 0xf1, 0xef, 0xc9, 0xc5, 0xf9, 0x6f, 0x05, 0x2e, 0x3d, 0xec

[2951144] signal (4.2): Illegal instruction
in expression starting at REPL[2]:1
iterate at ./range.jl:887 [inlined]
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:110
abstract_call_known at ./compiler/abstractinterpretation.jl:1949
abstract_call at ./compiler/abstractinterpretation.jl:2020
abstract_call at ./compiler/abstractinterpretation.jl:1999
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2183
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2396
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2682
typeinf_local at ./compiler/abstractinterpretation.jl:2867
typeinf_nocycle at ./compiler/abstractinterpretation.jl:2955
_typeinf at ./compiler/typeinfer.jl:246
typeinf at ./compiler/typeinfer.jl:219 [inlined]
...

When I do it in reverse (precompile on B and run on A), the example runs on both machines without issues. Of course, the problem might reappear at some point, which I couldn’t find so far.
If I do the same on Julia 1.8.5 (same version of Cthulhu), it works both ways and there is also a noticeable lag when running @descend sort([5,4,3]) for the first time on both machinees, which seems to indicate that native code is compiled on both machines independently (instead of being precompiled on a single machine).

Steps to reproduce

I’m testing this on two different machines at the moment, which consistently reproduces the error. These are the specs

Machine A

versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 1 on 16 virtual cores
Environment:
...

Machine B

versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × AMD EPYC 7352 24-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 1 on 96 virtual cores
Environment:
...

To reproduce the error, I use an empty Julia depot. What I’m testing are these steps

Start REPL with an empty depot on one machine.
Run ] add Cthulhu to let it build the depot and precompile (the version is v2.9.0).
Run using Cthulhu; @descend sort([5,4,3]) (just an example).
Start Julia with the same (non-empty) depot on the other machine and run using Cthulhu; @descend sort([5,4,3]) again to potentially trigger the error.

Questions

It feels a lot like this is related to native code caching and that I’m not supposed to use the same depot for different machines with different processors. Is this generally true? In Julia 1.8 I had no problems with the current setup (using one depot for different machines). Using a separate depot for each type of machine would cause a lot of overhead.
I couldn’t reproduce the illegal instruction error with other code so far, so I’m wondering if this is related to Cthulhu specifically? I’m fine with not using Cthulhu, since I don’t really need it in that environment. I caught the issue more by accident, since it was still installed, but it made me a bit paranoid about more errors later during operation.

Any hints to what is actually going on and what would be a good way to deal with the issue are greatly appreciated

PS: The original illegal instruction was a different one and appeared while precompiling Cthulhu on machine B, but it also used an old depot with stuff already in it and I can’t reproduce the same opcode in a fresh depot.

Different illegal opcode from precompilation

Invalid instruction at 0x15487a71163b: 0x62, 0xd3, 0xfd, 0x08, 0x1f, 0x06, 0x04, 0xc5, 0xf9, 0x98, 0xc0, 0x75, 0x2a, 0x48, 0x8b

giordano · September 2, 2023, 9:58am

You can generate native code for multiple targets by setting the environment variable JULIA_CPU_TARGET appropriately.

ffevotte · September 2, 2023, 10:42am

Another part of the documentation, which nicely complements the page linked above:

https://docs.julialang.org/en/v1/devdocs/sysimg/#sysimg-multi-versioning

Sevi · September 2, 2023, 12:26pm

Thanks @giordano and @ffevotte for pointing me in the right direction!

Running the same with JULIA_CPU_TARGET=generic the first time the native code is compiled fixed the issue.

The part of the docs I missed, which explains what is going in 1.9, is here Package Images · The Julia Language

Specifically, it mentions

Package images optimized for multiple microarchitectures

Similar to multi-versioning for system images, package images support multi-versioning. If you are in a heterogenous environment, with a unified cache, you can set the environment variable JULIA_CPU_TARGET=generic to multi-version the object caches.

If I understand this correctly now, the environment variable JULIA_CPU_TARGET, which defaults to native was only relevant for generating sysimages before 1.9. Since in 1.9+ the default behavior is to also generate package images (cache native code), it can happen that a package image is compiled too restrictively for a specific CPU. The other instance which uses the same cache then runs into illegal instructions.

I still have a question though:
Since the default sysimage that is shipped with Julia is complied for generic targets, wouldn’t it make sense to also default to generic for the package image compilation (as far as I can tell, the same variable controls both sysimage and package image, so maybe it’s not so easy to use different defaults in both cases)? Or does the performance gain from defaulting to native always outweigh the (small) chance for such issues. I would assume precompilation for generic targets would take longer than just for the native architecture?

I guess the situation I encountered above is not particularly common, but it definitely feels like a pitfall when upgrading from 1.8 to 1.9 (some code does not work the same way it did before the upgrade).

giordano · September 2, 2023, 6:03pm

This “works” but it’ll also generate very bad code for your CPUs, especially when vectorisation instructions could be used. Instead, you want to generate code for both cascadelake and znver2, please refer to the documentation pages shared above.

Sevi · September 2, 2023, 8:37pm

I understand, but I chose generic as a “quick fix” since I don’t yet have a list of all architectures of the cluster machines, there are more than the two in the example (when I have it, or when restricting to certain nodes, I can just target them directly of course).

So far, the performance doesn’t seem dramatically different. Are there any benchmarks that show how big the speedup between generically compiled package images and native ones actually is on a real-world example?

Sevi · September 3, 2023, 12:24pm

Just a quick update: Specifying multiple cpu targets doesn’t seem to affect the package image compilation. I’ve tried to specify both targets of the two test machines as (znver2 was the architecture of the offending machine B).

This works

$ JULIA_CPU_TARGET="znver2;cascadelake" julia

but this doesn’t

$ JULIA_CPU_TARGET="cascadelake;znver2" julia

which suggests that only the first entry matters for the precompilation. If I try the command-line option which is mentioned in one of the linked parts of the documentation, it works as well

$ julia -C generic # works
$ julia -C znver2 # works
$ julia -C cascadelake # doesn't work

but it is not possible to specify more than one target with -C.

@giordano The part of the docs I quoted explicitly states that JULIA_CPU_TARGET=generic should be used for package image compilation when using a shared cache. Am I still missing something?

giordano · September 3, 2023, 12:35pm

Sevi:

This works
$ JULIA_CPU_TARGET="znver2;cascadelake" julia
but this doesn’t
$ JULIA_CPU_TARGET="cascadelake;znver2" julia
which suggests that only the first entry matters for the precompilation.

Please read the “Note” box in System Image Building · The Julia Language

Sevi · September 3, 2023, 12:59pm

Using JULIA_CPU_TARGET="cascadelake;znver2,clone_all" doesn’t work either.

Please note that this question was never about system image building, but about package image compilation. I’m not trying to split hairs here, I would just like to understand what is going on.

I’ve read the parts of the docs you mentioned, and the page about system image building does mention that generating package images is “similar”, but doesn’t go into much detail. The only information I can find in the docs specifically about my issue is to use generic…

giordano · September 3, 2023, 1:50pm

The syntax is the same, that’s why that page had been linked above by @ffevotte.

jishnub · September 3, 2023, 3:14pm

In the past, I have used export JULIA_CPU_TARGET="generic;skylake-avx512,clone_all;znver2,clone_all" on a heterogeneous cluster to resolve this issue, so perhaps you need something similar with the appropriate targets?

Sevi · September 3, 2023, 9:43pm

At least for me that’s not obvious from the docs. Nor that they do the same thing internally, even if they have the same syntax.

Thanks for the suggestion! Yes that works. For me, any combination of CPU targets which start with generic;... or znver2;... work. Is there a simple way to check if actually something different happens when only specifying generic vs. adding more targets like generic;znver2;... ? So far I couldn’t come up with a good test for that.

Or, put differently, why doesn’t it work to use code, which was compiled on a cascadelake machine with JULIA_CPU_TARGET=cascadelake;znver2, on a znver2 machine (for the specific example above, it gives me the illegal instruction). If I run with the same option cascadelake;znver2 to precompile on the znver2 machine, it works on both. Putting clone_all anywhere doesn’t change the behavior.

Just to see if the behavior is the same for system images, I tried to run similar tests with creating sysimages, but that left me even more confused I think.

Sysimage compilation

I’ve compiled on both machines (every time with a fresh depot) with different combinations of cpu_targets using PackageCompiler.jl. The example workload is again descend from Cthulhu.jl. The Julia command is started without JULIA_CPU_TARGET.

create_sysimage(["Cthulhu"]; sysimage_path="JuliaSysimage.so", precompile_execution_file="precompile.jl", cpu_target="see below")

Compiling on any of the two machines has basically mirrored outcome. Any of these target specifications (A=cascadelake, B=znver2)

A
B
A;B
B;A

with any combination of adding clone_all works on the machine it was compiled on, but if I “cross-compile”, it only works on the target machine if the cpu_target is the first one in the list. E.g. compiling on machine A with B;A would work on both machines, but compiling on machine A with A;B only works on machine A, and vice versa. The error happens at startup of the Julia session with the respective sysimage and is either an “illegal instruction” or Unable to find compatible target in system image.

So (as is also mentioned in the docs) the order of the cpu targets clearly matters. What’s not clear to me is why it seems that adding the correct cpu target in a later position does not seem to impact the sysimage generation, with or without clone_all. I’m aware that adding generic as the default target in first position would work, but I still would like to understand the observation described above (about switching the cpu targets and observing different results).

simsurace · September 4, 2023, 7:30am

Maybe this recent issue is instructive (the actual target may be subtly different from what you believe). Issue 50102

Generic target provides a fallback, it doesn‘t hurt to always put it whatever else one may be targeting.

Sevi · September 4, 2023, 11:56am

Thanks, that’s definitely helpful! I didn’t see this issue before. To be honest, I still don’t really understand what’s going on, but at this point, I think I have sunk enough of everyone’s time into this

Yeah that’s what I went for now (generic followed by a list of the possible architectures I found on the cluster).

roi.holtzman · October 16, 2023, 9:04am

I am now battling a similar issue. Can you please share how you found out the architecture you have on your cluster?

hijit · October 16, 2023, 9:36am

Thanks for posting this, I just happen to be trying to get my code to run on an HPC now with multiple architectures, and I would of surely run into this issue if I didn’t happen to see your post, so just two thumbs up from me!

lmtzx9h4qqnt · October 16, 2023, 2:02pm

julia -e 'println(Sys.CPU_NAME)'?

roi.holtzman · October 16, 2023, 2:18pm

Yes, that works!

Sevi · October 16, 2023, 7:36pm

Yes, exactly this! I extracted all the machines’ hostnames in the cluster with (we’re using Slurm)

sinfo -ho %n

and looped over them, running the above command through SSH. It’s not particularly nice, but I couldn’t figure out a better (i.p. quicker) solution… all my ideas to do this “properly” involved requesting an allocation for every node through Slurm, but some of them are quite busy and others are set to only accept jobs requesting more than a certain memory threshold, so it’s hard to say how long this would have taken.

roi.holtzman · October 17, 2023, 7:22am

Thanks for the information!
We have a LSF cluster, and I could not find how to get this information from the documentation. I will keep looking for it. Thanks!

Topic		Replies	Views
Error precompiling on cluster General Usage question , cluster , precompilation	26	1280	January 9, 2024
How to compile a portable binary (at least across macs) with `juliac.jl` Tooling interoperability , compilation	12	6417	March 23, 2018
Understanding JULIA_CPU_TARGET New to Julia precompilation	5	718	July 9, 2024
PackageCompiler: cpu_target silently fails to create image for specified architecture General Usage package-compiler	4	312	June 16, 2023
Precompilation Fails on HPC General Usage error , precompilation	16	970	October 9, 2023

Julia 1.9, same depot with different machines?

Problem

Steps to reproduce

Questions

Package images optimized for multiple microarchitectures

Related topics