Hi! I’m using Julia 1.9 on an HPC cluster and ran into a strange error on some of the machines. I have a hunch that it’s related to native code caching and the fact that I currently use the same depot for all machines, but I’m a bit stuck investigating the actual issue and finding a good solution.
Problem
Running certain code on the affected machines gives an Illegal instruction
error. So far I could only trigger the error when using Cthulhu.jl, but I’m not sure if it’s really related to that package itself (see below for the steps to reproduce and the machine specs).
- When I first precompile on machine A and then run the example on machine B, Julia crashes with an
Illegal instruction
error.
Error details
julia> @descend sort([5,4,3])
Invalid instruction at 0x14f38d0600f2: 0xc5, 0xfc, 0x46, 0xc8, 0xc5, 0xf1, 0xef, 0xc9, 0xc5, 0xf9, 0x6f, 0x05, 0x2e, 0x3d, 0xec
[2951144] signal (4.2): Illegal instruction
in expression starting at REPL[2]:1
iterate at ./range.jl:887 [inlined]
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:110
abstract_call_known at ./compiler/abstractinterpretation.jl:1949
abstract_call at ./compiler/abstractinterpretation.jl:2020
abstract_call at ./compiler/abstractinterpretation.jl:1999
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2183
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2396
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2682
typeinf_local at ./compiler/abstractinterpretation.jl:2867
typeinf_nocycle at ./compiler/abstractinterpretation.jl:2955
_typeinf at ./compiler/typeinfer.jl:246
typeinf at ./compiler/typeinfer.jl:219 [inlined]
...
- When I do it in reverse (precompile on B and run on A), the example runs on both machines without issues. Of course, the problem might reappear at some point, which I couldn’t find so far.
- If I do the same on Julia 1.8.5 (same version of Cthulhu), it works both ways and there is also a noticeable lag when running
@descend sort([5,4,3])
for the first time on both machinees, which seems to indicate that native code is compiled on both machines independently (instead of being precompiled on a single machine).
Steps to reproduce
I’m testing this on two different machines at the moment, which consistently reproduces the error. These are the specs
Machine A
versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 1 on 16 virtual cores
Environment:
...
Machine B
versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 96 × AMD EPYC 7352 24-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 1 on 96 virtual cores
Environment:
...
To reproduce the error, I use an empty Julia depot. What I’m testing are these steps
- Start REPL with an empty depot on one machine.
- Run
] add Cthulhu
to let it build the depot and precompile (the version isv2.9.0
). - Run
using Cthulhu; @descend sort([5,4,3])
(just an example). - Start Julia with the same (non-empty) depot on the other machine and run
using Cthulhu; @descend sort([5,4,3])
again to potentially trigger the error.
Questions
- It feels a lot like this is related to native code caching and that I’m not supposed to use the same depot for different machines with different processors. Is this generally true? In Julia 1.8 I had no problems with the current setup (using one depot for different machines). Using a separate depot for each type of machine would cause a lot of overhead.
- I couldn’t reproduce the illegal instruction error with other code so far, so I’m wondering if this is related to Cthulhu specifically? I’m fine with not using Cthulhu, since I don’t really need it in that environment. I caught the issue more by accident, since it was still installed, but it made me a bit paranoid about more errors later during operation.
Any hints to what is actually going on and what would be a good way to deal with the issue are greatly appreciated
PS: The original illegal instruction was a different one and appeared while precompiling Cthulhu on machine B, but it also used an old depot with stuff already in it and I can’t reproduce the same opcode in a fresh depot.
Different illegal opcode from precompilation
Invalid instruction at 0x15487a71163b: 0x62, 0xd3, 0xfd, 0x08, 0x1f, 0x06, 0x04, 0xc5, 0xf9, 0x98, 0xc0, 0x75, 0x2a, 0x48, 0x8b