Neither Python (Numpy) nor C++ have rsqrt, but pytorch and tensorflow have it (I didn’t confirm in Flux.jl or Lux.jl or look carefully, though CUDA.jl/Nvidia and JavaScript stdlib have for single and double, and AMD/HIP has too), and CPUs have instructions for. Zig has this as no planned (likely since they optimize better already, see there for Gotbolt): rsqrt as builtin · Issue #19302 · ziglang/zig · GitHub and Rust has an open issue: Add rsqrt method to Float trait · Issue #1 · rust-num/num-traits · GitHub
1/sqrt(x) is common enough, that you want it to be fast, by default, and even to use fastest approximate result. It’s probably the only CPU (FPU) instruction I can think of not well-supported in Julia.
At first looking at if it’s supported, it seemingly wasn’t in Julia, i.e. code_native was verbose, though is seems supported, maybe in a good best possible way, for full accuracy (and bit-identical results across Intel and AMD), but not for speed.
Some background:
julia> sqrt(2)
1.4142135623730951
seems right, the only correct result (except for infinite-length… irrational), but I can argue for 1.4 accurate enough, at least no longer than 1.4142135f0 casted back to Float64 if wanted. It depends on the accuracy of the number 2, if one significant digit, it’s [1.5, 2.5]. sqrt(2.0) is still only 1.4, or even 1.396, sqrt(2.00) is 1.41.
So I actually think sqrt
of Float64, should return Float32, but that ship has sailed (at least for now), but for rsqrt
the implied sqrt
there can return Float32, while the final seemingly needs to return Float64, because of values in [0, 1], as done in benchmark code below, actually only values in [0, 1f-46] otherwise overflow).
julia> Float32(Float64(1/sqrt(1f-46)))
Inf32
My conclusion is not it is not worth bothering with SSE2 unless we make calculations on no less than 4 variables. (Maybe this applies to only rsqrt here but it is an expensive calculation (it also includes multiple multiplications) so it probably applies to other calculations too)
Also
sqrt(x)
is faster thanx*rsqrt(x)
with two iterations, and x*rsqrt(x) with one iteration is too inaccurate for distance calculation.
There is no function in the standard library that does this, but your compiler might optimize the expression 1 / sqrt(value) such that it does emit the RSQRTSS instruction.
[It doesn’t for Julia, even with fastmath.]
https://robert.ocallahan.org/2021/09/rr-trace-portability-diverging-behavior.html
Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: “computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand” (emphasis mine). A simple test program:
..
asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
..
On Intel Skylake I get
out = 3d7ff000, float = 0.062485
On AMD Rome I get
out = 3d7ff800, float = 0.062492
[Neither values are correct, or within [prevfloat, nextfloat] of the correctly computed 0.0625, unless 256 has only 3 or 4 significant decimal digits, i.e. going down to a range covering 256.95f0 is enough.]
Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...)
https://www.reddit.com/r/programming/comments/12tys7d/revisiting_the_fast_inverse_square_root_is_it/
# When called with @fastmath, 1 / sqrt(x::Float32) will use
# SSE single-precision rsqrt approximation followed by a single
# iteration of the Newton-Raphson method. This, followed by an
# additional double-precision Newton-Raphson iteration gives
# sufficient precision for this problem and is significantly
# faster than double-precision division and sqrt.
rd = @ntuple 10 k-> @fastmath Float64(1 / sqrt(Float32(dsq[k])))
julia> @code_lowered rsqrt(10.0f0)
CodeInfo(
1 ─ %1 = Base.FastMath
│ %2 = Base.getproperty(%1, :div_fast)
│ %3 = Base.FastMath
│ %4 = Base.getproperty(%3, :sqrt_fast)
│ %5 = Main.Float32(Main.x)
│ %6 = (%4)(%5)
│ %7 = (%2)(1, %6)
│ %8 = Main.Float64(%7)
└── return %8
)
One use of rsqrt
if in Diff Attension for DIFF Transformer (seem my Julia code translated from the paper there):
[Offtopic, but also curious, does Julia ever emit the “FPATAN — Partial Arctangent” instruction? Seemingly not. Couldn’t confirm maybe behind the call rax
I only see in the assembly. Is there a good way to know and/or get full inlined assembly?
x86 FPATAN instruction · Issue #19330 · ziglang/zig · GitHub]