Error precompiling on cluster

I am running a simple script on a cluster. I simply perform many instances of the same script.

I get the following error:

<E2><94><8C> Warning: The call to compilecache failed to create a usable precompiled cache file for FileIO [5789e2e9-d7fb-5bc7-8068-2c6fae9b9549]
<E2><94><82>   exception = ArgumentError: Invalid checksum in cache file /home/labs/orenraz/roiho/.julia/compiled/v1.9/FileIO/6iKRU_gncnE.so.
<E2><94><94> @ Base loading.jl:1818

Followed by

<E2><94><94> @ Base loading.jl:1793
<E2><94><8C> Warning: Module FileIO with build ID ffffffff-ffff-ffff-001c-a130a3b6bf6f is missing from the cache.
<E2><94><82> This may mean FileIO [5789e2e9-d7fb-5bc7-8068-2c6fae9b9549] does not support precompilation but is imported by a module that does.

My script was working on Julia 1.7.2, but I just upgraded to 1.9.3, and it stopped working. Does anyone know why?

Julia 1.9 caches compiled code, Julia 1.7.2 did not. However, this is only a warning so I’m assuming your code is still working? If not, please share more details.

Thanks for replying!

I think you might be right and that the code may still be running well. I am checking this now and will update.

@carstenbauer It seems that every time I open an ssh session with the cluster, I need to precompile my project. Is that the case in julia 1.9? Is there a way to avoid it?

I have had a lot of difficulty in getting things to work on cluster as well. I think at this point I just compile on one node first and hope for the best.

Getting local (login node) cache to be reused seems impossible now

I guess I’m lucky that our login nodes have the same CPU structure as our compute nodes.

What happens if you set a proper JULIA_CPU_TARGET? Do you still have issues?

I have set that for generic for both login and compute node and still have issue.

Our login and compute node are almost identical except Numa configuration is different, which shouldn’t change compile validity but it does:

@carstenbauer @jling thank you both.

I precompiled the package in my ssh session and ran the scripts, but I still get the same error:

/scratch/1697361899.830137.65: line 8: 58683 Segmentation fault      (core dumped)

It seems from what you are saying that Julia 1.9 cannot be run on the cluster properly. So I guess the only resort is to go back to Julia 1.7, right?

You say β€œthe same” but you didn’t mention a segfault before but only some precompilation warnings.

Well, for me it just runs fine on various HPC clusters.

2 Likes

oops. You are right. I was sure that I did. Sorry.

Could you please clarify this comment? What do you suggest I try here? What is setting this variable properly?

You need to set something like

export JULIA_CPU_TARGET="generic;skylake-avx512,clone_all;znver2,clone_all"

where you may need to replace the different architectures as appropriate. Please do this before you start Julia. This will lead to the compiled images being compatible with different CPU instruction set architectures.

2 Likes

Thanks very much!
I am looking at the documentation, but I am not sure how to find out what are my appropriate settings. Do you know if there is some guidance somewhere for this?

I think you need to check what CPU is being used on the cluster nodes, and set the architecture accordingly. The list of accepted architectures may be found using julia -C help.

2 Likes

If you want to do it from Julia

using CpuId
cpuinfo()

is a good start. (CpuId.jl)

Thank you.
Though I do not know what to do with the information I got:

julia> cpuinfo()
  Cpu Property       Value
  –––––––––––––––––– ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  Brand              Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
  Vendor             :Intel
  Architecture       :UnknownIntel
  Model              Family: 0x06, Model: 0x6a, Stepping: 0x06, Type: 0x00
  Cores              26 physical cores, 52 logical cores (on executing CPU)
                     Hyperthreading hardware capability detected
  Clock Frequencies  2200 / 3400 MHz (base/max), 100 MHz bus
  Data Cache         Level 1:3 : (48, 1280, 39936) kbytes
                     64 byte cache line size
  Address Size       57 bits virtual, 46 bits physical
  SIMD               512 bit = 64 byte max. SIMD vector size
  Time Stamp Counter TSC is accessible via `rdtsc`
                     TSC runs at constant rate (invariant from clock frequency)
  Perf. Monitoring   Performance Monitoring Counters (PMC) revision 5
                     Available hardware counters per logical core:
                     4 fixed-function counters of 48 bit width
                     8 general-purpose counters of 48 bit width
  Hypervisor         No

I don’t find something that matches this here:

$ julia -C help
Available CPUs for this target:

  alderlake      - Select the alderlake processor.
  amdfam10       - Select the amdfam10 processor.
  athlon         - Select the athlon processor.
  athlon-4       - Select the athlon-4 processor.
  athlon-fx      - Select the athlon-fx processor.
  athlon-mp      - Select the athlon-mp processor.
  athlon-tbird   - Select the athlon-tbird processor.
  athlon-xp      - Select the athlon-xp processor.
  athlon64       - Select the athlon64 processor.
  athlon64-sse3  - Select the athlon64-sse3 processor.
  atom           - Select the atom processor.
  barcelona      - Select the barcelona processor.
  bdver1         - Select the bdver1 processor.
  bdver2         - Select the bdver2 processor.
  bdver3         - Select the bdver3 processor.
  bdver4         - Select the bdver4 processor.
  bonnell        - Select the bonnell processor.
  broadwell      - Select the broadwell processor.
  btver1         - Select the btver1 processor.
  btver2         - Select the btver2 processor.
  c3             - Select the c3 processor.
  c3-2           - Select the c3-2 processor.
  cannonlake     - Select the cannonlake processor.
  cascadelake    - Select the cascadelake processor.
  cooperlake     - Select the cooperlake processor.
  core-avx-i     - Select the core-avx-i processor.
  core-avx2      - Select the core-avx2 processor.
  core2          - Select the core2 processor.
  corei7         - Select the corei7 processor.
  corei7-avx     - Select the corei7-avx processor.
  generic        - Select the generic processor.
  geode          - Select the geode processor.
  goldmont       - Select the goldmont processor.
  goldmont-plus  - Select the goldmont-plus processor.
  haswell        - Select the haswell processor.
  i386           - Select the i386 processor.
  i486           - Select the i486 processor.
  i586           - Select the i586 processor.
  i686           - Select the i686 processor.
  icelake-client - Select the icelake-client processor.
  icelake-server - Select the icelake-server processor.
  ivybridge      - Select the ivybridge processor.
  k6             - Select the k6 processor.
  k6-2           - Select the k6-2 processor.
  k6-3           - Select the k6-3 processor.
  k8             - Select the k8 processor.
  k8-sse3        - Select the k8-sse3 processor.
  knl            - Select the knl processor.
  knm            - Select the knm processor.
  lakemont       - Select the lakemont processor.
  nehalem        - Select the nehalem processor.
  nocona         - Select the nocona processor.
  opteron        - Select the opteron processor.
  opteron-sse3   - Select the opteron-sse3 processor.
  penryn         - Select the penryn processor.
  pentium        - Select the pentium processor.
  pentium-m      - Select the pentium-m processor.
  pentium-mmx    - Select the pentium-mmx processor.
  pentium2       - Select the pentium2 processor.
  pentium3       - Select the pentium3 processor.
  pentium3m      - Select the pentium3m processor.
  pentium4       - Select the pentium4 processor.
  pentium4m      - Select the pentium4m processor.
  pentiumpro     - Select the pentiumpro processor.
  prescott       - Select the prescott processor.
  rocketlake     - Select the rocketlake processor.
  sandybridge    - Select the sandybridge processor.
  sapphirerapids - Select the sapphirerapids processor.
  silvermont     - Select the silvermont processor.
  skx            - Select the skx processor.
  skylake        - Select the skylake processor.
  skylake-avx512 - Select the skylake-avx512 processor.
  slm            - Select the slm processor.
  tigerlake      - Select the tigerlake processor.
  tremont        - Select the tremont processor.
  westmere       - Select the westmere processor.
  winchip-c6     - Select the winchip-c6 processor.
  winchip2       - Select the winchip2 processor.
  x86-64         - Select the x86-64 processor.
  x86-64-v2      - Select the x86-64-v2 processor.
  x86-64-v3      - Select the x86-64-v3 processor.
  x86-64-v4      - Select the x86-64-v4 processor.
  yonah          - Select the yonah processor.
  znver1         - Select the znver1 processor.
  znver2         - Select the znver2 processor.
  znver3         - Select the znver3 processor.

Available features for this target:

  16bit-mode                      - 16-bit mode (i8086).
  32bit-mode                      - 32-bit mode (80386).
  3dnow                           - Enable 3DNow! instructions.
  3dnowa                          - Enable 3DNow! Athlon instructions.
  64bit                           - Support 64-bit instructions.
  64bit-mode                      - 64-bit mode (x86_64).
  adx                             - Support ADX instructions.
  aes                             - Enable AES instructions.
  amx-bf16                        - Support AMX-BF16 instructions.
  amx-int8                        - Support AMX-INT8 instructions.
  amx-tile                        - Support AMX-TILE instructions.
  avx                             - Enable AVX instructions.
  avx2                            - Enable AVX2 instructions.
  avx512bf16                      - Support bfloat16 floating point.
  avx512bitalg                    - Enable AVX-512 Bit Algorithms.
  avx512bw                        - Enable AVX-512 Byte and Word Instructions.
  avx512cd                        - Enable AVX-512 Conflict Detection Instructions.
  avx512dq                        - Enable AVX-512 Doubleword and Quadword Instructions.
  avx512er                        - Enable AVX-512 Exponential and Reciprocal Instructions.
  avx512f                         - Enable AVX-512 instructions.
  avx512fp16                      - Support 16-bit floating point.
  avx512ifma                      - Enable AVX-512 Integer Fused Multiple-Add.
  avx512pf                        - Enable AVX-512 PreFetch Instructions.
  avx512vbmi                      - Enable AVX-512 Vector Byte Manipulation Instructions.
  avx512vbmi2                     - Enable AVX-512 further Vector Byte Manipulation Instructions.
  avx512vl                        - Enable AVX-512 Vector Length eXtensions.
  avx512vnni                      - Enable AVX-512 Vector Neural Network Instructions.
  avx512vp2intersect              - Enable AVX-512 vp2intersect.
  avx512vpopcntdq                 - Enable AVX-512 Population Count Instructions.
  avxvnni                         - Support AVX_VNNI encoding.
  bmi                             - Support BMI instructions.
  bmi2                            - Support BMI2 instructions.
  branchfusion                    - CMP/TEST can be fused with conditional branches.
  cldemote                        - Enable Cache Demote.
  clflushopt                      - Flush A Cache Line Optimized.
  clwb                            - Cache Line Write Back.
  clzero                          - Enable Cache Line Zero.
  cmov                            - Enable conditional move instructions.
  crc32                           - Enable SSE 4.2 CRC32 instruction.
  cx16                            - 64-bit with cmpxchg16b.
  cx8                             - Support CMPXCHG8B instructions.
  enqcmd                          - Has ENQCMD instructions.
  ermsb                           - REP MOVS/STOS are fast.
  f16c                            - Support 16-bit floating point conversion instructions.
  false-deps-lzcnt-tzcnt          - LZCNT/TZCNT have a false dependency on dest register.
  false-deps-popcnt               - POPCNT has a false dependency on dest register.
  fast-11bytenop                  - Target can quickly decode up to 11 byte NOPs.
  fast-15bytenop                  - Target can quickly decode up to 15 byte NOPs.
  fast-7bytenop                   - Target can quickly decode up to 7 byte NOPs.
  fast-bextr                      - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.
  fast-gather                     - Indicates if gather is reasonably fast.
  fast-hops                       - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles.
  fast-lzcnt                      - LZCNT instructions are as fast as most simple integer ops.
  fast-movbe                      - Prefer a movbe over a single-use load + bswap / single-use bswap + store.
  fast-scalar-fsqrt               - Scalar SQRT is fast (disable Newton-Raphson).
  fast-scalar-shift-masks         - Prefer a left/right scalar logical shift pair over a shift+and pair.
  fast-shld-rotate                - SHLD can be used as a faster rotate.
  fast-variable-crosslane-shuffle - Cross-lane shuffles with variable masks are fast.
  fast-variable-perlane-shuffle   - Per-lane shuffles with variable masks are fast.
  fast-vector-fsqrt               - Vector SQRT is fast (disable Newton-Raphson).
  fast-vector-shift-masks         - Prefer a left/right vector logical shift pair over a shift+and pair.
  fma                             - Enable three-operand fused multiple-add.
  fma4                            - Enable four-operand fused multiple-add.
  fsgsbase                        - Support FS/GS Base instructions.
  fsrm                            - REP MOVSB of short lengths is faster.
  fxsr                            - Support fxsave/fxrestore instructions.
  gfni                            - Enable Galois Field Arithmetic Instructions.
  hreset                          - Has hreset instruction.
  idivl-to-divb                   - Use 8-bit divide for positive values less than 256.
  idivq-to-divl                   - Use 32-bit divide for positive values less than 2^32.
  invpcid                         - Invalidate Process-Context Identifier.
  kl                              - Support Key Locker kl Instructions.
  lea-sp                          - Use LEA for adjusting the stack pointer.
  lea-uses-ag                     - LEA instruction needs inputs at AG stage.
  lvi-cfi                         - Prevent indirect calls/branches from using a memory operand, and precede all indirect calls/branches from a register with an LFENCE instruction to serialize control flow. Also decompose RET instructions into a POP+LFENCE+JMP sequence..
  lvi-load-hardening              - Insert LFENCE instructions to prevent data speculatively injected into loads from being used maliciously..
  lwp                             - Enable LWP instructions.
  lzcnt                           - Support LZCNT instruction.
  macrofusion                     - Various instructions can be fused with conditional branches.
  mmx                             - Enable MMX instructions.
  movbe                           - Support MOVBE instruction.
  movdir64b                       - Support movdir64b instruction.
  movdiri                         - Support movdiri instruction.
  mwaitx                          - Enable MONITORX/MWAITX timer functionality.
  nopl                            - Enable NOPL instruction.
  pad-short-functions             - Pad short functions.
  pclmul                          - Enable packed carry-less multiplication instructions.
  pconfig                         - platform configuration instruction.
  pku                             - Enable protection keys.
  popcnt                          - Support POPCNT instruction.
  prefer-128-bit                  - Prefer 128-bit AVX instructions.
  prefer-256-bit                  - Prefer 256-bit AVX instructions.
  prefer-mask-registers           - Prefer AVX512 mask registers over PTEST/MOVMSK.
  prefetchwt1                     - Prefetch with Intent to Write and T1 Hint.
  prfchw                          - Support PRFCHW instructions.
  ptwrite                         - Support ptwrite instruction.
  rdpid                           - Support RDPID instructions.
  rdrnd                           - Support RDRAND instruction.
  rdseed                          - Support RDSEED instruction.
  retpoline                       - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.
  retpoline-external-thunk        - When lowering an indirect call or branch using a `retpoline`, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.
  retpoline-indirect-branches     - Remove speculation of indirect branches from the generated code.
  retpoline-indirect-calls        - Remove speculation of indirect calls from the generated code.
  rtm                             - Support RTM instructions.
  sahf                            - Support LAHF and SAHF instructions in 64-bit mode.
  serialize                       - Has serialize instruction.
  seses                           - Prevent speculative execution side channel timing attacks by inserting a speculation barrier before memory reads, memory writes, and conditional branches. Implies LVI Control Flow integrity..
  sgx                             - Enable Software Guard Extensions.
  sha                             - Enable SHA instructions.
  shstk                           - Support CET Shadow-Stack instructions.
  slow-3ops-lea                   - LEA instruction with 3 ops or certain registers is slow.
  slow-incdec                     - INC and DEC instructions are slower than ADD and SUB.
  slow-lea                        - LEA instruction with certain arguments is slow.
  slow-pmaddwd                    - PMADDWD is slower than PMULLD.
  slow-pmulld                     - PMULLD instruction is slow.
  slow-shld                       - SHLD instruction is slow.
  slow-two-mem-ops                - Two memory operand instructions are slow.
  slow-unaligned-mem-16           - Slow unaligned 16-byte memory access.
  slow-unaligned-mem-32           - Slow unaligned 32-byte memory access.
  soft-float                      - Use software floating point features.
  sse                             - Enable SSE instructions.
  sse-unaligned-mem               - Allow unaligned memory operands with SSE instructions.
  sse2                            - Enable SSE2 instructions.
  sse3                            - Enable SSE3 instructions.
  sse4.1                          - Enable SSE 4.1 instructions.
  sse4.2                          - Enable SSE 4.2 instructions.
  sse4a                           - Support SSE 4a instructions.
  ssse3                           - Enable SSSE3 instructions.
  tagged-globals                  - Use an instruction sequence for taking the address of a global that allows a memory tag in the upper address bits..
  tbm                             - Enable TBM instructions.
  tsxldtrk                        - Support TSXLDTRK instructions.
  uintr                           - Has UINTR Instructions.
  use-aa                          - Use alias analysis during codegen.
  use-glm-div-sqrt-costs          - Use Goldmont specific floating point div/sqrt costs.
  use-slm-arith-costs             - Use Silvermont specific arithmetic costs.
  vaes                            - Promote selected AES instructions to AVX512/AVX registers.
  vpclmulqdq                      - Enable vpclmulqdq instructions.
  vzeroupper                      - Should insert vzeroupper instructions.
  waitpkg                         - Wait and pause enhancements.
  wbnoinvd                        - Write Back No Invalidate.
  widekl                          - Support Key Locker wide Instructions.
  x87                             - Enable X87 float instructions.
  xop                             - Enable XOP instructions.
  xsave                           - Support xsave instructions.
  xsavec                          - Support xsavec instructions.
  xsaveopt                        - Support xsaveopt instructions.
  xsaves                          - Support xsaves instructions.

I got some info running lscpu

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                52
On-line CPU(s) list:   0-51
Thread(s) per core:    1
Core(s) per socket:    26
Socket(s):             2
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 106
Model name:            Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
Stepping:              6
CPU MHz:               2200.000
BogoMIPS:              4400.00
Virtualization:        VT-x
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              39936K
NUMA node0 CPU(s):     0-12
NUMA node1 CPU(s):     13-25
NUMA node2 CPU(s):     26-38
NUMA node3 CPU(s):     39-51
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single ssbd mba rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

From the documentation, it seems like this is the generic (they write " This creates a system image with three separate targets; one for a generic x86_64 processor").
Does this mean that I need to run ?

export JULIA_CPU_TARGET="generic"

Looks like you want icelake-server for this CPU. What other CPU do you have on the other nodes?

1 Like

Thanks! so I should run

export JULIA_CPU_TARGET="generic;icelake-server,clone_all"

or

export JULIA_CPU_TARGET="icelake-server,clone_all"

?

I am not sure how to check this for other nodes… Do you have any idea?

I think the first is better, as the generic is a fallback. Can you log in to the compute nodes and check the CPU details? You may also have a documentation for the cluster?

1 Like

Thanks. I am trying to figure out how to find the CPU of the nodes. Will update you shortly.

If your cluster runs a slurm workload manager you should be able to log into a compute node using srun --pty /bin/bash. From there you can then run lscpu or cpuinfo() from Julia to gather the infos.

1 Like