I’m trying to set up a SIMD intrinsic for rsqrt
:
function rsqrt(f::Vec{4, Float32})
llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float>) nounwind readnone", "
%2 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %0)
ret <4 x float> %2"), Vec{4, Float32}, (Vec{4, Float32},), f)
end
I’m fairly sure this has the correct mapping from Julia’s Vec{4, Float32}
to LLVM’s <4 x float>
vector type, but I’m still seeing:
error compiling rsqrt: error statically evaluating llvmcall argument tuple
What’s would be the best way to debug the LLVM error? Godbolt seems to think the LLVM instructions are fine: Compiler Explorer
function rsqrt(f::Vec{4, Float32})
v = ccall("llvm.x86.sse.rsqrt.ps", llvmcall, NTuple{4, VecElement{Float32}}, (NTuple{4, VecElement{Float32}},), f.elts);
Vec{4, Float32}(v)
end
might be a bit easier than text llvmcall
:
julia> v = Vec{4, Float32}((1.0, 2.0, 3.0, 4.0))
<4 x Float32>[1.0, 2.0, 3.0, 4.0]
julia> rsqrt(v)
<4 x Float32>[0.99975586, 0.7069092, 0.5772705, 0.49987793]
The point is that you cannot use Vec
directly for the argument and argument types. It needs to be the native type NTuple{4, VecElement{Float32}}
SIMD and SIMD-intrinsics in Julia | Kristoffer Carlsson might have something useful.
1 Like
Thank you! I agree that ccall
is probably better for a single intrinsic. For more context, I’m actually trying to write:
_mm_cvtps_pd(_mm_rsqrt_ps(_mm_cvtpd_ps(m128d)))
Can this work with nested ccall
or does the fact that we’re converting contiguous 4 bytes from a pair of doubles to a pair of floats and back make types tricky?
Also, I didn’t have any luck substituting out Vec
, still getting the same error with this:
function rsqrt(f::NTuple{4, VecElement{Float32}})
llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float>) nounwind readnone", "
%2 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %0)
ret <4 x float> %2"),
NTuple{4, VecElement{Float32}},
(NTuple{4, VecElement{Float32}},), f)
end
It needs to be Tuple{NTuple{4, VecElement{Float32}},}
for the input argument type. Things are a bit inconsistent when you can use a tuple and when you need a tuple type.
1 Like
Perfect, that’s exactly what I missed.
For anyone else trying this, using a final tuple value works seems to work with other argument types, but not a vector of floats.
1 Like