Compilation options for Downfall mitigation

mmesiti · October 11, 2023, 11:21am

Hello everybody,
the code I am working on has suffered from a huge performance hit (up to 50%) because of the microcode update to mitigate the Downfall vulnerability. The generated code was using a lot of gather instructions in vector registers, apparently.

For clang, one can try to use -Xclang -target-feature -Xclang +prefer-no-gather, -mno-gather to make sure it avoids using gather instructions.

Is there a way to affect the JIT compilation in Julia in a similar way?

Salmon · October 19, 2023, 2:14pm

Maybe we can ask the experts:
@Elrod ?

Elrod · October 22, 2023, 5:56pm

mitigations=off =P

Kidding.
You can try julia -Cnative,-fast-gather. However, I still saw vpgatherqq instructions in the test that I tried.

When Julia eventually upgrades to LLVM 17, you should be able to use julia -Cnative,+prefer-no-gather,+prefer-no-scatter.
I think the best place to look for these options is the .td files:

github.com

llvm/llvm-project/blob/release/17.x/llvm/lib/Target/X86/X86.td

//===-- X86.td - Target definition file for the Intel X86 --*- tablegen -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This is a target description file for the Intel i386 architecture, referred
// to here as the "X86" architecture.
//
//===----------------------------------------------------------------------===//

// Get the target-independent interfaces which we are implementing...
//
include "llvm/Target/Target.td"

//===----------------------------------------------------------------------===//
// X86 Subtarget state
//

This file has been truncated. show original

vchuravy · October 22, 2023, 6:25pm

vchuravy@odin ~> julia -Chelp
The latest version of Julia in the `alpha` channel is 1.10.0-beta3+0.x64.linux.gnu. You currently have `1.10.0-beta2+0.x64.linux.gnu` installed. Run:

  juliaup update

to install Julia 1.10.0-beta3+0.x64.linux.gnu and update the `alpha` channel to that version.
Available CPUs for this target:

  alderlake      - Select the alderlake processor.
  amdfam10       - Select the amdfam10 processor.
  athlon         - Select the athlon processor.
  athlon-4       - Select the athlon-4 processor.
  athlon-fx      - Select the athlon-fx processor.
  athlon-mp      - Select the athlon-mp processor.
  athlon-tbird   - Select the athlon-tbird processor.
  athlon-xp      - Select the athlon-xp processor.
  athlon64       - Select the athlon64 processor.
  athlon64-sse3  - Select the athlon64-sse3 processor.
  atom           - Select the atom processor.
  barcelona      - Select the barcelona processor.
  bdver1         - Select the bdver1 processor.
  bdver2         - Select the bdver2 processor.
  bdver3         - Select the bdver3 processor.
  bdver4         - Select the bdver4 processor.
  bonnell        - Select the bonnell processor.
  broadwell      - Select the broadwell processor.
  btver1         - Select the btver1 processor.
  btver2         - Select the btver2 processor.
  c3             - Select the c3 processor.
  c3-2           - Select the c3-2 processor.
  cannonlake     - Select the cannonlake processor.
  cascadelake    - Select the cascadelake processor.
  cooperlake     - Select the cooperlake processor.
  core-avx-i     - Select the core-avx-i processor.
  core-avx2      - Select the core-avx2 processor.
  core2          - Select the core2 processor.
  corei7         - Select the corei7 processor.
  corei7-avx     - Select the corei7-avx processor.
  generic        - Select the generic processor.
  geode          - Select the geode processor.
  goldmont       - Select the goldmont processor.
  goldmont-plus  - Select the goldmont-plus processor.
  haswell        - Select the haswell processor.
  i386           - Select the i386 processor.
  i486           - Select the i486 processor.
  i586           - Select the i586 processor.
  i686           - Select the i686 processor.
  icelake-client - Select the icelake-client processor.
  icelake-server - Select the icelake-server processor.
  ivybridge      - Select the ivybridge processor.
  k6             - Select the k6 processor.
  k6-2           - Select the k6-2 processor.
  k6-3           - Select the k6-3 processor.
  k8             - Select the k8 processor.
  k8-sse3        - Select the k8-sse3 processor.
  knl            - Select the knl processor.
  knm            - Select the knm processor.
  lakemont       - Select the lakemont processor.
  nehalem        - Select the nehalem processor.
  nocona         - Select the nocona processor.
  opteron        - Select the opteron processor.
  opteron-sse3   - Select the opteron-sse3 processor.
  penryn         - Select the penryn processor.
  pentium        - Select the pentium processor.
  pentium-m      - Select the pentium-m processor.
  pentium-mmx    - Select the pentium-mmx processor.
  pentium2       - Select the pentium2 processor.
  pentium3       - Select the pentium3 processor.
  pentium3m      - Select the pentium3m processor.
  pentium4       - Select the pentium4 processor.
  pentium4m      - Select the pentium4m processor.
  pentiumpro     - Select the pentiumpro processor.
  prescott       - Select the prescott processor.
  rocketlake     - Select the rocketlake processor.
  sandybridge    - Select the sandybridge processor.
  sapphirerapids - Select the sapphirerapids processor.
  silvermont     - Select the silvermont processor.
  skx            - Select the skx processor.
  skylake        - Select the skylake processor.
  skylake-avx512 - Select the skylake-avx512 processor.
  slm            - Select the slm processor.
  tigerlake      - Select the tigerlake processor.
  tremont        - Select the tremont processor.
  westmere       - Select the westmere processor.
  winchip-c6     - Select the winchip-c6 processor.
  winchip2       - Select the winchip2 processor.
  x86-64         - Select the x86-64 processor.
  x86-64-v2      - Select the x86-64-v2 processor.
  x86-64-v3      - Select the x86-64-v3 processor.
  x86-64-v4      - Select the x86-64-v4 processor.
  yonah          - Select the yonah processor.
  znver1         - Select the znver1 processor.
  znver2         - Select the znver2 processor.
  znver3         - Select the znver3 processor.

Available features for this target:

  16bit-mode                      - 16-bit mode (i8086).
  32bit-mode                      - 32-bit mode (80386).
  3dnow                           - Enable 3DNow! instructions.
  3dnowa                          - Enable 3DNow! Athlon instructions.
  64bit                           - Support 64-bit instructions.
  64bit-mode                      - 64-bit mode (x86_64).
  adx                             - Support ADX instructions.
  aes                             - Enable AES instructions.
  amx-bf16                        - Support AMX-BF16 instructions.
  amx-int8                        - Support AMX-INT8 instructions.
  amx-tile                        - Support AMX-TILE instructions.
  avx                             - Enable AVX instructions.
  avx2                            - Enable AVX2 instructions.
  avx512bf16                      - Support bfloat16 floating point.
  avx512bitalg                    - Enable AVX-512 Bit Algorithms.
  avx512bw                        - Enable AVX-512 Byte and Word Instructions.
  avx512cd                        - Enable AVX-512 Conflict Detection Instructions.
  avx512dq                        - Enable AVX-512 Doubleword and Quadword Instructions.
  avx512er                        - Enable AVX-512 Exponential and Reciprocal Instructions.
  avx512f                         - Enable AVX-512 instructions.
  avx512fp16                      - Support 16-bit floating point.
  avx512ifma                      - Enable AVX-512 Integer Fused Multiple-Add.
  avx512pf                        - Enable AVX-512 PreFetch Instructions.
  avx512vbmi                      - Enable AVX-512 Vector Byte Manipulation Instructions.
  avx512vbmi2                     - Enable AVX-512 further Vector Byte Manipulation Instructions.
  avx512vl                        - Enable AVX-512 Vector Length eXtensions.
  avx512vnni                      - Enable AVX-512 Vector Neural Network Instructions.
  avx512vp2intersect              - Enable AVX-512 vp2intersect.
  avx512vpopcntdq                 - Enable AVX-512 Population Count Instructions.
  avxvnni                         - Support AVX_VNNI encoding.
  bmi                             - Support BMI instructions.
  bmi2                            - Support BMI2 instructions.
  branchfusion                    - CMP/TEST can be fused with conditional branches.
  cldemote                        - Enable Cache Line Demote.
  clflushopt                      - Flush A Cache Line Optimized.
  clwb                            - Cache Line Write Back.
  clzero                          - Enable Cache Line Zero.
  cmov                            - Enable conditional move instructions.
  crc32                           - Enable SSE 4.2 CRC32 instruction (used when SSE4.2 is supported but function is GPR only).
  cx16                            - 64-bit with cmpxchg16b (this is true for most x86-64 chips, but not the first AMD chips).
  cx8                             - Support CMPXCHG8B instructions.
  enqcmd                          - Has ENQCMD instructions.
  ermsb                           - REP MOVS/STOS are fast.
  f16c                            - Support 16-bit floating point conversion instructions.
  false-deps-getmant              - VGETMANTSS/SD/SH and VGETMANDPS/PD(memory version) has a false dependency on dest register.
  false-deps-lzcnt-tzcnt          - LZCNT/TZCNT have a false dependency on dest register.
  false-deps-mulc                 - VF[C]MULCPH/SH has a false dependency on dest register.
  false-deps-mullq                - VPMULLQ has a false dependency on dest register.
  false-deps-perm                 - VPERMD/Q/PS/PD has a false dependency on dest register.
  false-deps-popcnt               - POPCNT has a false dependency on dest register.
  false-deps-range                - VRANGEPD/PS/SD/SS has a false dependency on dest register.
  fast-11bytenop                  - Target can quickly decode up to 11 byte NOPs.
  fast-15bytenop                  - Target can quickly decode up to 15 byte NOPs.
  fast-7bytenop                   - Target can quickly decode up to 7 byte NOPs.
  fast-bextr                      - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.
  fast-gather                     - Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs).
  fast-hops                       - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles.
  fast-lzcnt                      - LZCNT instructions are as fast as most simple integer ops.
  fast-movbe                      - Prefer a movbe over a single-use load + bswap / single-use bswap + store.
  fast-scalar-fsqrt               - Scalar SQRT is fast (disable Newton-Raphson).
  fast-scalar-shift-masks         - Prefer a left/right scalar logical shift pair over a shift+and pair.
  fast-shld-rotate                - SHLD can be used as a faster rotate.
  fast-variable-crosslane-shuffle - Cross-lane shuffles with variable masks are fast.
  fast-variable-perlane-shuffle   - Per-lane shuffles with variable masks are fast.
  fast-vector-fsqrt               - Vector SQRT is fast (disable Newton-Raphson).
  fast-vector-shift-masks         - Prefer a left/right vector logical shift pair over a shift+and pair.
  fma                             - Enable three-operand fused multiple-add.
  fma4                            - Enable four-operand fused multiple-add.
  fsgsbase                        - Support FS/GS Base instructions.
  fsrm                            - REP MOVSB of short lengths is faster.
  fxsr                            - Support fxsave/fxrestore instructions.
  gfni                            - Enable Galois Field Arithmetic Instructions.
  harden-sls-ijmp                 - Harden against straight line speculation across indirect JMP instructions..
  harden-sls-ret                  - Harden against straight line speculation across RET instructions..
  hreset                          - Has hreset instruction.
  idivl-to-divb                   - Use 8-bit divide for positive values less than 256.
  idivq-to-divl                   - Use 32-bit divide for positive values less than 2^32.
  invpcid                         - Invalidate Process-Context Identifier.
  kl                              - Support Key Locker kl Instructions.
  lea-sp                          - Use LEA for adjusting the stack pointer (this is an optimization for Intel Atom processors).
  lea-uses-ag                     - LEA instruction needs inputs at AG stage.
  lvi-cfi                         - Prevent indirect calls/branches from using a memory operand, and precede all indirect calls/branches from a register with an LFENCE instruction to serialize control flow. Also decompose RET instructions into a POP+LFENCE+JMP sequence..
  lvi-load-hardening              - Insert LFENCE instructions to prevent data speculatively injected into loads from being used maliciously..
  lwp                             - Enable LWP instructions.
  lzcnt                           - Support LZCNT instruction.
  macrofusion                     - Various instructions can be fused with conditional branches.
  mmx                             - Enable MMX instructions.
  movbe                           - Support MOVBE instruction.
  movdir64b                       - Support movdir64b instruction (direct store 64 bytes).
  movdiri                         - Support movdiri instruction (direct store integer).
  mwaitx                          - Enable MONITORX/MWAITX timer functionality.
  nopl                            - Enable NOPL instruction (generally pentium pro+).
  pad-short-functions             - Pad short functions (to prevent a stall when returning too early).
  pclmul                          - Enable packed carry-less multiplication instructions.
  pconfig                         - platform configuration instruction.
  pku                             - Enable protection keys.
  popcnt                          - Support POPCNT instruction.
  prefer-128-bit                  - Prefer 128-bit AVX instructions.
  prefer-256-bit                  - Prefer 256-bit AVX instructions.
  prefer-mask-registers           - Prefer AVX512 mask registers over PTEST/MOVMSK.
  prefetchwt1                     - Prefetch with Intent to Write and T1 Hint.
  prfchw                          - Support PRFCHW instructions.
  ptwrite                         - Support ptwrite instruction.
  rdpid                           - Support RDPID instructions.
  rdpru                           - Support RDPRU instructions.
  rdrnd                           - Support RDRAND instruction.
  rdseed                          - Support RDSEED instruction.
  retpoline                       - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.
  retpoline-external-thunk        - When lowering an indirect call or branch using a `retpoline`, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.
  retpoline-indirect-branches     - Remove speculation of indirect branches from the generated code.
  retpoline-indirect-calls        - Remove speculation of indirect calls from the generated code.
  rtm                             - Support RTM instructions.
  sahf                            - Support LAHF and SAHF instructions in 64-bit mode.
  sbb-dep-breaking                - SBB with same register has no source dependency.
  serialize                       - Has serialize instruction.
  seses                           - Prevent speculative execution side channel timing attacks by inserting a speculation barrier before memory reads, memory writes, and conditional branches. Implies LVI Control Flow integrity..
  sgx                             - Enable Software Guard Extensions.
  sha                             - Enable SHA instructions.
  shstk                           - Support CET Shadow-Stack instructions.
  slow-3ops-lea                   - LEA instruction with 3 ops or certain registers is slow.
  slow-incdec                     - INC and DEC instructions are slower than ADD and SUB.
  slow-lea                        - LEA instruction with certain arguments is slow.
  slow-pmaddwd                    - PMADDWD is slower than PMULLD.
  slow-pmulld                     - PMULLD instruction is slow (compared to PMULLW/PMULHW and PMULUDQ).
  slow-shld                       - SHLD instruction is slow.
  slow-two-mem-ops                - Two memory operand instructions are slow.
  slow-unaligned-mem-16           - Slow unaligned 16-byte memory access.
  slow-unaligned-mem-32           - Slow unaligned 32-byte memory access.
  soft-float                      - Use software floating point features.
  sse                             - Enable SSE instructions.
  sse-unaligned-mem               - Allow unaligned memory operands with SSE instructions (this may require setting a configuration bit in the processor).
  sse2                            - Enable SSE2 instructions.
  sse3                            - Enable SSE3 instructions.
  sse4.1                          - Enable SSE 4.1 instructions.
  sse4.2                          - Enable SSE 4.2 instructions.
  sse4a                           - Support SSE 4a instructions.
  ssse3                           - Enable SSSE3 instructions.
  tagged-globals                  - Use an instruction sequence for taking the address of a global that allows a memory tag in the upper address bits..
  tbm                             - Enable TBM instructions.
  tsxldtrk                        - Support TSXLDTRK instructions.
  uintr                           - Has UINTR Instructions.
  use-glm-div-sqrt-costs          - Use Goldmont specific floating point div/sqrt costs.
  use-slm-arith-costs             - Use Silvermont specific arithmetic costs.
  vaes                            - Promote selected AES instructions to AVX512/AVX registers.
  vpclmulqdq                      - Enable vpclmulqdq instructions.
  vzeroupper                      - Should insert vzeroupper instructions.
  waitpkg                         - Wait and pause enhancements.
  wbnoinvd                        - Write Back No Invalidate.
  widekl                          - Support Key Locker wide Instructions.
  x87                             - Enable X87 float instructions.
  xop                             - Enable XOP instructions.
  xsave                           - Support xsave instructions.
  xsavec                          - Support xsavec instructions.
  xsaveopt                        - Support xsaveopt instructions.
  xsaves                          - Support xsaves instructions.

Use +feature to enable a feature, or -feature to disable it.
For example, llc -mcpu=mycpu -mattr=+feature1,-feature2

mmesiti · October 25, 2023, 9:21am

Oh I see, I overlooked the -C option and and -Chelp. Thank you for these kind answers instead of RTFM.

This looks a little misleading:

Use +feature to enable a feature, or -feature to disable it.
For example, llc -mcpu=mycpu -mattr=+feature1,-feature2

But I guess it’s hard to fix since it comes from https://github.com/llvm/llvm-project/blob/release/17.x/llvm/lib/MC/MCSubtargetInfo.cpp#L123, right?

In any case, as expected I do not see any improvement compared to launching Julia without -Cnative,-fast-gather. I will wait for LLVM 17 and rerun the benchmark then.

Topic		Replies	Views
Jeff Bezanson remarks on LLVM "getting slower and slower" Internals	15	7775	April 12, 2021
Cranelift: a faster alternative to LLVM Offtopic	15	11529	March 19, 2024
Impact of Spectre on Julia? Internals & Design	19	2423	January 14, 2018
What LLVM version to use, 10, 11 possible? And how to reduce startup time (for the Benchmark Game)? Performance	3	1087	March 30, 2020
Using Intel LLVM compiler with Julia Performance question	6	721	April 5, 2022

Compilation options for Downfall mitigation

Related topics