Hello everybody,
the code I am working on has suffered from a huge performance hit (up to 50%) because of the microcode update to mitigate the Downfall vulnerability. The generated code was using a lot of gather instructions in vector registers, apparently.
For clang, one can try to use -Xclang -target-feature -Xclang +prefer-no-gather, -mno-gather to make sure it avoids using gather instructions.
Is there a way to affect the JIT compilation in Julia in a similar way?
Kidding.
You can try julia -Cnative,-fast-gather. However, I still saw vpgatherqq instructions in the test that I tried.
When Julia eventually upgrades to LLVM 17, you should be able to use julia -Cnative,+prefer-no-gather,+prefer-no-scatter.
I think the best place to look for these options is the .td files:
vchuravy@odin ~> julia -Chelp
The latest version of Julia in the `alpha` channel is 1.10.0-beta3+0.x64.linux.gnu. You currently have `1.10.0-beta2+0.x64.linux.gnu` installed. Run:
juliaup update
to install Julia 1.10.0-beta3+0.x64.linux.gnu and update the `alpha` channel to that version.
Available CPUs for this target:
alderlake - Select the alderlake processor.
amdfam10 - Select the amdfam10 processor.
athlon - Select the athlon processor.
athlon-4 - Select the athlon-4 processor.
athlon-fx - Select the athlon-fx processor.
athlon-mp - Select the athlon-mp processor.
athlon-tbird - Select the athlon-tbird processor.
athlon-xp - Select the athlon-xp processor.
athlon64 - Select the athlon64 processor.
athlon64-sse3 - Select the athlon64-sse3 processor.
atom - Select the atom processor.
barcelona - Select the barcelona processor.
bdver1 - Select the bdver1 processor.
bdver2 - Select the bdver2 processor.
bdver3 - Select the bdver3 processor.
bdver4 - Select the bdver4 processor.
bonnell - Select the bonnell processor.
broadwell - Select the broadwell processor.
btver1 - Select the btver1 processor.
btver2 - Select the btver2 processor.
c3 - Select the c3 processor.
c3-2 - Select the c3-2 processor.
cannonlake - Select the cannonlake processor.
cascadelake - Select the cascadelake processor.
cooperlake - Select the cooperlake processor.
core-avx-i - Select the core-avx-i processor.
core-avx2 - Select the core-avx2 processor.
core2 - Select the core2 processor.
corei7 - Select the corei7 processor.
corei7-avx - Select the corei7-avx processor.
generic - Select the generic processor.
geode - Select the geode processor.
goldmont - Select the goldmont processor.
goldmont-plus - Select the goldmont-plus processor.
haswell - Select the haswell processor.
i386 - Select the i386 processor.
i486 - Select the i486 processor.
i586 - Select the i586 processor.
i686 - Select the i686 processor.
icelake-client - Select the icelake-client processor.
icelake-server - Select the icelake-server processor.
ivybridge - Select the ivybridge processor.
k6 - Select the k6 processor.
k6-2 - Select the k6-2 processor.
k6-3 - Select the k6-3 processor.
k8 - Select the k8 processor.
k8-sse3 - Select the k8-sse3 processor.
knl - Select the knl processor.
knm - Select the knm processor.
lakemont - Select the lakemont processor.
nehalem - Select the nehalem processor.
nocona - Select the nocona processor.
opteron - Select the opteron processor.
opteron-sse3 - Select the opteron-sse3 processor.
penryn - Select the penryn processor.
pentium - Select the pentium processor.
pentium-m - Select the pentium-m processor.
pentium-mmx - Select the pentium-mmx processor.
pentium2 - Select the pentium2 processor.
pentium3 - Select the pentium3 processor.
pentium3m - Select the pentium3m processor.
pentium4 - Select the pentium4 processor.
pentium4m - Select the pentium4m processor.
pentiumpro - Select the pentiumpro processor.
prescott - Select the prescott processor.
rocketlake - Select the rocketlake processor.
sandybridge - Select the sandybridge processor.
sapphirerapids - Select the sapphirerapids processor.
silvermont - Select the silvermont processor.
skx - Select the skx processor.
skylake - Select the skylake processor.
skylake-avx512 - Select the skylake-avx512 processor.
slm - Select the slm processor.
tigerlake - Select the tigerlake processor.
tremont - Select the tremont processor.
westmere - Select the westmere processor.
winchip-c6 - Select the winchip-c6 processor.
winchip2 - Select the winchip2 processor.
x86-64 - Select the x86-64 processor.
x86-64-v2 - Select the x86-64-v2 processor.
x86-64-v3 - Select the x86-64-v3 processor.
x86-64-v4 - Select the x86-64-v4 processor.
yonah - Select the yonah processor.
znver1 - Select the znver1 processor.
znver2 - Select the znver2 processor.
znver3 - Select the znver3 processor.
Available features for this target:
16bit-mode - 16-bit mode (i8086).
32bit-mode - 32-bit mode (80386).
3dnow - Enable 3DNow! instructions.
3dnowa - Enable 3DNow! Athlon instructions.
64bit - Support 64-bit instructions.
64bit-mode - 64-bit mode (x86_64).
adx - Support ADX instructions.
aes - Enable AES instructions.
amx-bf16 - Support AMX-BF16 instructions.
amx-int8 - Support AMX-INT8 instructions.
amx-tile - Support AMX-TILE instructions.
avx - Enable AVX instructions.
avx2 - Enable AVX2 instructions.
avx512bf16 - Support bfloat16 floating point.
avx512bitalg - Enable AVX-512 Bit Algorithms.
avx512bw - Enable AVX-512 Byte and Word Instructions.
avx512cd - Enable AVX-512 Conflict Detection Instructions.
avx512dq - Enable AVX-512 Doubleword and Quadword Instructions.
avx512er - Enable AVX-512 Exponential and Reciprocal Instructions.
avx512f - Enable AVX-512 instructions.
avx512fp16 - Support 16-bit floating point.
avx512ifma - Enable AVX-512 Integer Fused Multiple-Add.
avx512pf - Enable AVX-512 PreFetch Instructions.
avx512vbmi - Enable AVX-512 Vector Byte Manipulation Instructions.
avx512vbmi2 - Enable AVX-512 further Vector Byte Manipulation Instructions.
avx512vl - Enable AVX-512 Vector Length eXtensions.
avx512vnni - Enable AVX-512 Vector Neural Network Instructions.
avx512vp2intersect - Enable AVX-512 vp2intersect.
avx512vpopcntdq - Enable AVX-512 Population Count Instructions.
avxvnni - Support AVX_VNNI encoding.
bmi - Support BMI instructions.
bmi2 - Support BMI2 instructions.
branchfusion - CMP/TEST can be fused with conditional branches.
cldemote - Enable Cache Line Demote.
clflushopt - Flush A Cache Line Optimized.
clwb - Cache Line Write Back.
clzero - Enable Cache Line Zero.
cmov - Enable conditional move instructions.
crc32 - Enable SSE 4.2 CRC32 instruction (used when SSE4.2 is supported but function is GPR only).
cx16 - 64-bit with cmpxchg16b (this is true for most x86-64 chips, but not the first AMD chips).
cx8 - Support CMPXCHG8B instructions.
enqcmd - Has ENQCMD instructions.
ermsb - REP MOVS/STOS are fast.
f16c - Support 16-bit floating point conversion instructions.
false-deps-getmant - VGETMANTSS/SD/SH and VGETMANDPS/PD(memory version) has a false dependency on dest register.
false-deps-lzcnt-tzcnt - LZCNT/TZCNT have a false dependency on dest register.
false-deps-mulc - VF[C]MULCPH/SH has a false dependency on dest register.
false-deps-mullq - VPMULLQ has a false dependency on dest register.
false-deps-perm - VPERMD/Q/PS/PD has a false dependency on dest register.
false-deps-popcnt - POPCNT has a false dependency on dest register.
false-deps-range - VRANGEPD/PS/SD/SS has a false dependency on dest register.
fast-11bytenop - Target can quickly decode up to 11 byte NOPs.
fast-15bytenop - Target can quickly decode up to 15 byte NOPs.
fast-7bytenop - Target can quickly decode up to 7 byte NOPs.
fast-bextr - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.
fast-gather - Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs).
fast-hops - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles.
fast-lzcnt - LZCNT instructions are as fast as most simple integer ops.
fast-movbe - Prefer a movbe over a single-use load + bswap / single-use bswap + store.
fast-scalar-fsqrt - Scalar SQRT is fast (disable Newton-Raphson).
fast-scalar-shift-masks - Prefer a left/right scalar logical shift pair over a shift+and pair.
fast-shld-rotate - SHLD can be used as a faster rotate.
fast-variable-crosslane-shuffle - Cross-lane shuffles with variable masks are fast.
fast-variable-perlane-shuffle - Per-lane shuffles with variable masks are fast.
fast-vector-fsqrt - Vector SQRT is fast (disable Newton-Raphson).
fast-vector-shift-masks - Prefer a left/right vector logical shift pair over a shift+and pair.
fma - Enable three-operand fused multiple-add.
fma4 - Enable four-operand fused multiple-add.
fsgsbase - Support FS/GS Base instructions.
fsrm - REP MOVSB of short lengths is faster.
fxsr - Support fxsave/fxrestore instructions.
gfni - Enable Galois Field Arithmetic Instructions.
harden-sls-ijmp - Harden against straight line speculation across indirect JMP instructions..
harden-sls-ret - Harden against straight line speculation across RET instructions..
hreset - Has hreset instruction.
idivl-to-divb - Use 8-bit divide for positive values less than 256.
idivq-to-divl - Use 32-bit divide for positive values less than 2^32.
invpcid - Invalidate Process-Context Identifier.
kl - Support Key Locker kl Instructions.
lea-sp - Use LEA for adjusting the stack pointer (this is an optimization for Intel Atom processors).
lea-uses-ag - LEA instruction needs inputs at AG stage.
lvi-cfi - Prevent indirect calls/branches from using a memory operand, and precede all indirect calls/branches from a register with an LFENCE instruction to serialize control flow. Also decompose RET instructions into a POP+LFENCE+JMP sequence..
lvi-load-hardening - Insert LFENCE instructions to prevent data speculatively injected into loads from being used maliciously..
lwp - Enable LWP instructions.
lzcnt - Support LZCNT instruction.
macrofusion - Various instructions can be fused with conditional branches.
mmx - Enable MMX instructions.
movbe - Support MOVBE instruction.
movdir64b - Support movdir64b instruction (direct store 64 bytes).
movdiri - Support movdiri instruction (direct store integer).
mwaitx - Enable MONITORX/MWAITX timer functionality.
nopl - Enable NOPL instruction (generally pentium pro+).
pad-short-functions - Pad short functions (to prevent a stall when returning too early).
pclmul - Enable packed carry-less multiplication instructions.
pconfig - platform configuration instruction.
pku - Enable protection keys.
popcnt - Support POPCNT instruction.
prefer-128-bit - Prefer 128-bit AVX instructions.
prefer-256-bit - Prefer 256-bit AVX instructions.
prefer-mask-registers - Prefer AVX512 mask registers over PTEST/MOVMSK.
prefetchwt1 - Prefetch with Intent to Write and T1 Hint.
prfchw - Support PRFCHW instructions.
ptwrite - Support ptwrite instruction.
rdpid - Support RDPID instructions.
rdpru - Support RDPRU instructions.
rdrnd - Support RDRAND instruction.
rdseed - Support RDSEED instruction.
retpoline - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.
retpoline-external-thunk - When lowering an indirect call or branch using a `retpoline`, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.
retpoline-indirect-branches - Remove speculation of indirect branches from the generated code.
retpoline-indirect-calls - Remove speculation of indirect calls from the generated code.
rtm - Support RTM instructions.
sahf - Support LAHF and SAHF instructions in 64-bit mode.
sbb-dep-breaking - SBB with same register has no source dependency.
serialize - Has serialize instruction.
seses - Prevent speculative execution side channel timing attacks by inserting a speculation barrier before memory reads, memory writes, and conditional branches. Implies LVI Control Flow integrity..
sgx - Enable Software Guard Extensions.
sha - Enable SHA instructions.
shstk - Support CET Shadow-Stack instructions.
slow-3ops-lea - LEA instruction with 3 ops or certain registers is slow.
slow-incdec - INC and DEC instructions are slower than ADD and SUB.
slow-lea - LEA instruction with certain arguments is slow.
slow-pmaddwd - PMADDWD is slower than PMULLD.
slow-pmulld - PMULLD instruction is slow (compared to PMULLW/PMULHW and PMULUDQ).
slow-shld - SHLD instruction is slow.
slow-two-mem-ops - Two memory operand instructions are slow.
slow-unaligned-mem-16 - Slow unaligned 16-byte memory access.
slow-unaligned-mem-32 - Slow unaligned 32-byte memory access.
soft-float - Use software floating point features.
sse - Enable SSE instructions.
sse-unaligned-mem - Allow unaligned memory operands with SSE instructions (this may require setting a configuration bit in the processor).
sse2 - Enable SSE2 instructions.
sse3 - Enable SSE3 instructions.
sse4.1 - Enable SSE 4.1 instructions.
sse4.2 - Enable SSE 4.2 instructions.
sse4a - Support SSE 4a instructions.
ssse3 - Enable SSSE3 instructions.
tagged-globals - Use an instruction sequence for taking the address of a global that allows a memory tag in the upper address bits..
tbm - Enable TBM instructions.
tsxldtrk - Support TSXLDTRK instructions.
uintr - Has UINTR Instructions.
use-glm-div-sqrt-costs - Use Goldmont specific floating point div/sqrt costs.
use-slm-arith-costs - Use Silvermont specific arithmetic costs.
vaes - Promote selected AES instructions to AVX512/AVX registers.
vpclmulqdq - Enable vpclmulqdq instructions.
vzeroupper - Should insert vzeroupper instructions.
waitpkg - Wait and pause enhancements.
wbnoinvd - Write Back No Invalidate.
widekl - Support Key Locker wide Instructions.
x87 - Enable X87 float instructions.
xop - Enable XOP instructions.
xsave - Support xsave instructions.
xsavec - Support xsavec instructions.
xsaveopt - Support xsaveopt instructions.
xsaves - Support xsaves instructions.
Use +feature to enable a feature, or -feature to disable it.
For example, llc -mcpu=mycpu -mattr=+feature1,-feature2
In any case, as expected I do not see any improvement compared to launching Julia without -Cnative,-fast-gather. I will wait for LLVM 17 and rerun the benchmark then.