I’m working on writing a SIMD base64 encoding algorithm (from this paper), and I came across an annoying thing. Consider the following example method:
function test_mul(a::NTuple{16, VecElement{UInt16}}, b::NTuple{16, VecElement{UInt16}})
@. VecElement{UInt16}(getfield(a, :value) * getfield(b, :value))
end
Now, with code_llvm
I get the mul
I’d expect.
julia> code_llvm(test_mul; debuginfo=:none)
; Function Signature: test_mul(NTuple{16, VecElement{UInt16}}, NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_5627(<16 x i16> %"a::Tuple", <16 x i16> %"b::Tuple") #0 {
top:
%0 = mul <16 x i16> %"b::Tuple", %"a::Tuple"
ret <16 x i16> %0
}
But here’s the annoying part. If I hardcode one of the values, I get this:
const mask = ntuple(n->VecElement(n%2==1 ? 0x0100 : 0x0010), 16)
test_mul_2(a::NTuple{16, VecElement{UInt16}}) = test_mul(a, mask)
julia> code_llvm(test_mul_2; debuginfo=:none)
; Function Signature: test_mul_2(NTuple{16, VecElement{UInt16}})
define <16 x i16> @julia_test_mul_2_5883(<16 x i16> %"a::Tuple") #0 {
top:
%0 = shl <16 x i16> %"a::Tuple", <i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8>
ret <16 x i16> %0
}
This left shift may seem like a sensible change, but there’s a problem now. If we compare the native code there’s a bunch of extra instructions:
julia> code_native(test_mul; debuginfo=:none, dump_module=false)
.text
push rbp
mov rbp, rsp
mov rax, qword ptr [r13 + 16]
mov rax, qword ptr [rax + 16]
mov rax, qword ptr [rax]
vpmullw ymm0, ymm1, ymm0
pop rbp
ret
nop word ptr cs:[rax + rax]
julia> code_native(test_mul_2; debuginfo=:none, dump_module=false)
.text
push rbp
mov rbp, rsp
mov rax, qword ptr [r13 + 16]
mov rax, qword ptr [rax + 16]
mov rax, qword ptr [rax]
vpsllw ymm1, ymm0, 8
vpsllw ymm0, ymm0, 4
vpblendw ymm0, ymm0, ymm1, 170 # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
pop rbp
ret
nop word ptr cs:[rax + rax]
nop dword ptr [rax]
Since the base64 algorithm includes this and
that runs billions of times, this causes a noticable performance drop. Is there any way for me to force it to compile to the shorter version while still using a constant? I’ve tried @nospecialize
, but it doesn’t seem to work since the method gets inlined. I’ve also tried using llvmcall
and the change still happens. Thank you in advance!