How to prevent unwanted "optimization" in SIMD code?

I’m working on writing a SIMD base64 encoding algorithm (from this paper), and I came across an annoying thing. Consider the following example method:

function test_mul(a::NTuple{16, VecElement{UInt16}}, b::NTuple{16, VecElement{UInt16}})
  @. VecElement{UInt16}(getfield(a, :value) * getfield(b, :value))
end

Now, with code_llvm I get the mul I’d expect.

julia> code_llvm(test_mul; debuginfo=:none)
; Function Signature: test_mul(NTuple{16, VecElement{UInt16}}, NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_5627(<16 x i16> %"a::Tuple", <16 x i16> %"b::Tuple") #0 {
top:
  %0 = mul <16 x i16> %"b::Tuple", %"a::Tuple"
  ret <16 x i16> %0
}

But here’s the annoying part. If I hardcode one of the values, I get this:

const mask = ntuple(n->VecElement(n%2==1 ? 0x0100 : 0x0010), 16)
test_mul_2(a::NTuple{16, VecElement{UInt16}}) = test_mul(a, mask)
julia> code_llvm(test_mul_2; debuginfo=:none)
; Function Signature: test_mul_2(NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_2_5883(<16 x i16> %"a::Tuple") #0 {
top:
  %0 = shl <16 x i16> %"a::Tuple", <i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8>
  ret <16 x i16> %0
}

This left shift may seem like a sensible change, but there’s a problem now. If we compare the native code there’s a bunch of extra instructions:

julia> code_native(test_mul; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpmullw ymm0, ymm1, ymm0
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]

julia> code_native(test_mul_2; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpsllw  ymm1, ymm0, 8
        vpsllw  ymm0, ymm0, 4
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]
        nop     dword ptr [rax]

Since the base64 algorithm includes this and that runs billions of times, this causes a noticable performance drop. Is there any way for me to force it to compile to the shorter version while still using a constant? I’ve tried @nospecialize, but it doesn’t seem to work since the method gets inlined. I’ve also tried using llvmcall and the change still happens. Thank you in advance!

2 Likes

Interestingly, I’m seeing much less stupid codegen for test_mul_2 (both on Julia 1.11 and 1.13)

julia> code_native(test_mul_2; debuginfo=:none, dump_module=false)
	push	rbp
	mov	rbp, rsp
	mov	rax, qword ptr [r13 + 16]
	mov	rax, qword ptr [rax + 16]
	mov	rax, qword ptr [rax]
	movabs	rax, offset .rodata.cst32
	vpsllvw	ymm0, ymm0, ymmword ptr [rax]
	pop	rbp
	ret
	nop	word ptr cs:[rax + rax]

What CPU and Julia version is this?

Ah, looks like this generates reasonable code with AVX-512, but not without (because vpsllvw is an AVX-512 instruction). Specifically, see https://stackoverflow.com/a/66018798/5141328. Basically Intel just forgot one of the obviously useful shift instructions in AVX/AVX2 and only patched their work a couple years later (and then decided in 2020 to stop supporting AVX-512 because Intel is a horrible mess of a company with a ridiculous amount of infighting).

4 Likes

Thanks for the help! I see that vpsllvw is not available on my computer, but I do have vpmullw. If I could get the LLVM code to use mul instead of shl, it seems like I could still get efficient code. I don’t know how to prevent the change from happening, though… even using llvmcall,

function test_mul_3(a::NTuple{16, VecElement{UInt16}})
  Base.llvmcall("""
  %out = mul <16 x i16> %0, <i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256, i16 16, i16 256>
  ret <16 x i16> %out
  """, NTuple{16, VecElement{UInt16}}, Tuple{NTuple{16, VecElement{UInt16}}}, a)
end
julia> code_llvm(test_mul_3; debuginfo=:none)
; Function Signature: test_mul_3(NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_3_2889(<16 x i16> %"a::Tuple") #0 {
top:
  %out.i = shl <16 x i16> %"a::Tuple", <i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8>
  ret <16 x i16> %out.i
}

julia> code_native(test_mul_3; debuginfo=:none, dump_module=false)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        vpsllw  ymm1, ymm0, 8
        vpsllw  ymm0, ymm0, 4
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]
        nop     dword ptr [rax]

Can you think of any way to force the LLVM code to use mul? And do you know if this change being done by Julia or by LLVM?

Ideally the x86 codegen should choose to compute this shift as a vpmullw when vpsllvw isn’t available. The right answer here is probably an LLVM PR. The canonicalization LLVM is doing is correct, it’s just the lowing that needs to be smarter.

You should be able to just emit the assembly you’d like from llvm. I would think that llvm would be less likely to then ‘optimize’ it.
E.g. to syscall, LinuxPerf.jl does

res = Base.llvmcall("""%val = call i64 asm sideeffect "syscall", "={rax},{rax},{rdi},~{rcx},~{r11},~{memory}"(i64 %0, i64 %1)
                       ret i64 %val""", Int64, Tuple{Int64, Int64}, SYS_prctl, Int64(op))
1 Like

LLVM issue created: x86 missing optimization for variable shift left (without avx512) · Issue #140418 · llvm/llvm-project · GitHub

2 Likes

Thank you so much! I don’t think I have enough knowledge of LLVM to make the issue myself, so I’m glad you did it.

I’ll look into this more tomorrow, thank you. I was also able to get it working in a hacky way by putting @noinline on the top-level method and passing the constant value all the way down, but your solution looks better.