How to prevent unwanted "optimization" in SIMD code?

lntricate · May 17, 2025, 1:31am

I’m working on writing a SIMD base64 encoding algorithm (from this paper), and I came across an annoying thing. Consider the following example method:

function test_mul(a::NTuple{16, VecElement{UInt16}}, b::NTuple{16, VecElement{UInt16}})
  @. VecElement{UInt16}(getfield(a, :value) * getfield(b, :value))
end

Now, with code_llvm I get the mul I’d expect.

julia> code_llvm(test_mul; debuginfo=:none)
; Function Signature: test_mul(NTuple{16, VecElement{UInt16}}, NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_5627(<16 x i16> %"a::Tuple", <16 x i16> %"b::Tuple") #0 {
top:
  %0 = mul <16 x i16> %"b::Tuple", %"a::Tuple"
  ret <16 x i16> %0
}

But here’s the annoying part. If I hardcode one of the values, I get this:

const mask = ntuple(n->VecElement(n%2==1 ? 0x0100 : 0x0010), 16)
test_mul_2(a::NTuple{16, VecElement{UInt16}}) = test_mul(a, mask)

julia> code_llvm(test_mul_2; debuginfo=:none)
; Function Signature: test_mul_2(NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_2_5883(<16 x i16> %"a::Tuple") #0 {
top:
  %0 = shl <16 x i16> %"a::Tuple", <i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8>
  ret <16 x i16> %0
}

This left shift may seem like a sensible change, but there’s a problem now. If we compare the native code there’s a bunch of extra instructions:

julia> code_native(test_mul; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpmullw ymm0, ymm1, ymm0
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]

julia> code_native(test_mul_2; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpsllw  ymm1, ymm0, 8
        vpsllw  ymm0, ymm0, 4
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]
        nop     dword ptr [rax]

Since the base64 algorithm includes this and that runs billions of times, this causes a noticable performance drop. Is there any way for me to force it to compile to the shorter version while still using a constant? I’ve tried @nospecialize, but it doesn’t seem to work since the method gets inlined. I’ve also tried using llvmcall and the change still happens. Thank you in advance!

Oscar_Smith · May 17, 2025, 4:34am

Interestingly, I’m seeing much less stupid codegen for test_mul_2 (both on Julia 1.11 and 1.13)

julia> code_native(test_mul_2; debuginfo=:none, dump_module=false)
	push	rbp
	mov	rbp, rsp
	mov	rax, qword ptr [r13 + 16]
	mov	rax, qword ptr [rax + 16]
	mov	rax, qword ptr [rax]
	movabs	rax, offset .rodata.cst32
	vpsllvw	ymm0, ymm0, ymmword ptr [rax]
	pop	rbp
	ret
	nop	word ptr cs:[rax + rax]

What CPU and Julia version is this?

Oscar_Smith · May 17, 2025, 4:41am

Ah, looks like this generates reasonable code with AVX-512, but not without (because vpsllvw is an AVX-512 instruction). Specifically, see https://stackoverflow.com/a/66018798/5141328. Basically Intel just forgot one of the obviously useful shift instructions in AVX/AVX2 and only patched their work a couple years later (and then decided in 2020 to stop supporting AVX-512 because Intel is a horrible mess of a company with a ridiculous amount of infighting).

lntricate · May 17, 2025, 4:45pm

Thanks for the help! I see that vpsllvw is not available on my computer, but I do have vpmullw. If I could get the LLVM code to use mul instead of shl, it seems like I could still get efficient code. I don’t know how to prevent the change from happening, though… even using llvmcall,

function test_mul_3(a::NTuple{16, VecElement{UInt16}})
  Base.llvmcall("""
  %out = mul <16 x i16> %0, <i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256>
  ret <16 x i16> %out
  """, NTuple{16, VecElement{UInt16}}, Tuple{NTuple{16, VecElement{UInt16}}}, a)
end

julia> code_llvm(test_mul_3; debuginfo=:none)
; Function Signature: test_mul_3(NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_3_2889(<16 x i16> %"a::Tuple") #0 {
top:
  %out.i = shl <16 x i16> %"a::Tuple", <i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8>
  ret <16 x i16> %out.i
}

julia> code_native(test_mul_3; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpsllw  ymm1, ymm0, 8
        vpsllw  ymm0, ymm0, 4
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]
        nop     dword ptr [rax]

Can you think of any way to force the LLVM code to use mul? And do you know if this change being done by Julia or by LLVM?

Oscar_Smith · May 17, 2025, 5:01pm

Ideally the x86 codegen should choose to compute this shift as a vpmullw when vpsllvw isn’t available. The right answer here is probably an LLVM PR. The canonicalization LLVM is doing is correct, it’s just the lowing that needs to be smarter.

Zentrik · May 17, 2025, 11:15pm

You should be able to just emit the assembly you’d like from llvm. I would think that llvm would be less likely to then ‘optimize’ it.
E.g. to syscall, LinuxPerf.jl does

res = Base.llvmcall("""%val = call i64 asm sideeffect "syscall", "={rax},{rax},{rdi},~{rcx},~{r11},~{memory}"(i64 %0, i64 %1)
                       ret i64 %val""", Int64, Tuple{Int64, Int64}, SYS_prctl, Int64(op))

Oscar_Smith · May 17, 2025, 11:48pm

LLVM issue created: x86 missing optimization for variable shift left (without avx512) · Issue #140418 · llvm/llvm-project · GitHub

lntricate · May 18, 2025, 2:44am

Thank you so much! I don’t think I have enough knowledge of LLVM to make the issue myself, so I’m glad you did it.

lntricate · May 18, 2025, 2:59am

I’ll look into this more tomorrow, thank you. I was also able to get it working in a hacky way by putting @noinline on the top-level method and passing the constant value all the way down, but your solution looks better.

Topic		Replies	Views
LLVM code changes if code is wrapped in function Performance	2	328	March 15, 2023
Fancy LLVM loop transformation pass that deoptimize the code Performance llvm	2	202	April 16, 2025
How to shift bits faster General Usage	8	3457	January 10, 2019
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1870	June 27, 2021
Performance: replace conditional jump by some shift/and/or? Performance question	10	899	November 14, 2019

How to prevent unwanted "optimization" in SIMD code?

Related topics