Debugging float SIMD Intrinsics via llvmcall

I’m trying to set up a SIMD intrinsic for rsqrt:

function rsqrt(f::Vec{4, Float32})
    llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float>) nounwind readnone", " 
              %2 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %0)
              ret <4 x float> %2"), Vec{4, Float32}, (Vec{4, Float32},), f)
end

I’m fairly sure this has the correct mapping from Julia’s Vec{4, Float32} to LLVM’s <4 x float> vector type, but I’m still seeing:

error compiling rsqrt: error statically evaluating llvmcall argument tuple

What’s would be the best way to debug the LLVM error? Godbolt seems to think the LLVM instructions are fine: Compiler Explorer

function rsqrt(f::Vec{4, Float32})
    v = ccall("llvm.x86.sse.rsqrt.ps", llvmcall, NTuple{4, VecElement{Float32}}, (NTuple{4, VecElement{Float32}},), f.elts);
    Vec{4, Float32}(v)
end

might be a bit easier than text llvmcall:

julia> v = Vec{4, Float32}((1.0, 2.0, 3.0, 4.0))
<4 x Float32>[1.0, 2.0, 3.0, 4.0]
julia> rsqrt(v)
<4 x Float32>[0.99975586, 0.7069092, 0.5772705, 0.49987793]

The point is that you cannot use Vec directly for the argument and argument types. It needs to be the native type NTuple{4, VecElement{Float32}}

SIMD and SIMD-intrinsics in Julia | Kristoffer Carlsson might have something useful.

1 Like

Thank you! I agree that ccall is probably better for a single intrinsic. For more context, I’m actually trying to write:

_mm_cvtps_pd(_mm_rsqrt_ps(_mm_cvtpd_ps(m128d)))

Can this work with nested ccall or does the fact that we’re converting contiguous 4 bytes from a pair of doubles to a pair of floats and back make types tricky?

Also, I didn’t have any luck substituting out Vec, still getting the same error with this:

function rsqrt(f::NTuple{4, VecElement{Float32}})
    llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float>) nounwind readnone", " 
              %2 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %0)
              ret <4 x float> %2"),
        NTuple{4, VecElement{Float32}},
        (NTuple{4, VecElement{Float32}},), f)
end

It needs to be Tuple{NTuple{4, VecElement{Float32}},} for the input argument type. Things are a bit inconsistent when you can use a tuple and when you need a tuple type.

1 Like

Perfect, that’s exactly what I missed.

For anyone else trying this, using a final tuple value works seems to work with other argument types, but not a vector of floats.

1 Like