Should Julia have rsqrt? And e.g. support RSQRTSS

Neither Python (Numpy) nor C++ have rsqrt, but pytorch and tensorflow have it (I didn’t confirm in Flux.jl or Lux.jl or look carefully, though CUDA.jl/Nvidia and JavaScript stdlib have for single and double, and AMD/HIP has too), and CPUs have instructions for. Zig has this as no planned (likely since they optimize better already, see there for Gotbolt): rsqrt as builtin · Issue #19302 · ziglang/zig · GitHub and Rust has an open issue: Add rsqrt method to Float trait · Issue #1 · rust-num/num-traits · GitHub

1/sqrt(x) is common enough, that you want it to be fast, by default, and even to use fastest approximate result. It’s probably the only CPU (FPU) instruction I can think of not well-supported in Julia.

At first looking at if it’s supported, it seemingly wasn’t in Julia, i.e. code_native was verbose, though is seems supported, maybe in a good best possible way, for full accuracy (and bit-identical results across Intel and AMD), but not for speed.

Some background:

julia> sqrt(2)
1.4142135623730951

seems right, the only correct result (except for infinite-length… irrational), but I can argue for 1.4 accurate enough, at least no longer than 1.4142135f0 casted back to Float64 if wanted. It depends on the accuracy of the number 2, if one significant digit, it’s [1.5, 2.5]. sqrt(2.0) is still only 1.4, or even 1.396, sqrt(2.00) is 1.41.

So I actually think sqrt of Float64, should return Float32, but that ship has sailed (at least for now), but for rsqrt the implied sqrt there can return Float32, while the final seemingly needs to return Float64, because of values in [0, 1], as done in benchmark code below, actually only values in [0, 1f-46] otherwise overflow).

julia> Float32(Float64(1/sqrt(1f-46)))
Inf32

My conclusion is not it is not worth bothering with SSE2 unless we make calculations on no less than 4 variables. (Maybe this applies to only rsqrt here but it is an expensive calculation (it also includes multiple multiplications) so it probably applies to other calculations too)

Also sqrt(x) is faster than x*rsqrt(x) with two iterations, and x*rsqrt(x) with one iteration is too inaccurate for distance calculation.

There is no function in the standard library that does this, but your compiler might optimize the expression 1 / sqrt(value) such that it does emit the RSQRTSS instruction.

[It doesn’t for Julia, even with fastmath.]

https://robert.ocallahan.org/2021/09/rr-trace-portability-diverging-behavior.html

Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: “computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand” (emphasis mine). A simple test program:

..
asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
..

On Intel Skylake I get

out = 3d7ff000, float = 0.062485

On AMD Rome I get

out = 3d7ff800, float = 0.062492

[Neither values are correct, or within [prevfloat, nextfloat] of the correctly computed 0.0625, unless 256 has only 3 or 4 significant decimal digits, i.e. going down to a range covering 256.95f0 is enough.]

Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...) 

https://www.reddit.com/r/programming/comments/12tys7d/revisiting_the_fast_inverse_square_root_is_it/

@foobar_lv, @ChrisRackauckas

    # When called with @fastmath, 1 / sqrt(x::Float32) will use
    # SSE single-precision rsqrt approximation followed by a single
    # iteration of the Newton-Raphson method. This, followed by an
    # additional double-precision Newton-Raphson iteration gives
    # sufficient precision for this problem and is significantly
    # faster than double-precision division and sqrt.
    rd = @ntuple 10 k-> @fastmath Float64(1 / sqrt(Float32(dsq[k])))
julia> @code_lowered rsqrt(10.0f0)
CodeInfo(
1 ─ %1 = Base.FastMath
│   %2 = Base.getproperty(%1, :div_fast)
│   %3 = Base.FastMath
│   %4 = Base.getproperty(%3, :sqrt_fast)
│   %5 = Main.Float32(Main.x)
│   %6 = (%4)(%5)
│   %7 = (%2)(1, %6)
│   %8 = Main.Float64(%7)
└──      return %8
)

One use of rsqrt if in Diff Attension for DIFF Transformer (seem my Julia code translated from the paper there):

[Offtopic, but also curious, does Julia ever emit the “FPATAN — Partial Arctangent” instruction? Seemingly not. Couldn’t confirm maybe behind the call rax I only see in the assembly. Is there a good way to know and/or get full inlined assembly?
x86 FPATAN instruction · Issue #19330 · ziglang/zig · GitHub]

2 Likes

Already tracked among the issues on Github:

4 Likes

That’s likely coming from Stable HLO: StableHLO Specification  |  OpenXLA Project. But it doesn’t look like LLVM has a dedicated intrinsic: LLVM Language Reference Manual — LLVM 21.0.0git documentation

2 Likes

Fun fact Lux.jl can use that automatically via Reactant.jl (which will emit to use rsqrt where appropriate)

cc @avik-pal

3 Likes

It’s good to see on the TODO list there for IEEE-754 recommended functions.

But then a more controversial suggestion. What should be the return type? I suggest Float32 for Float64 input, and any input really, e.g. for Float16 too, except BigFloat, and I suppose bfloat should return the input type then bfloat.

This is the fastest option, can access the fastest (single-precision) assembly instruction.

Since it’s a function, we could have rsqrt(x; accurate=false) to call for same return type, with the default overridden. I don’t know if there’s president for return type to depend on the keyword argument not input type. I don’t think IEEE specifies that Float64 must return Float64.

There is precedent to NOT just rely on input type, e.g. 1/3 is Float64 not an Int or a Rational.

Should all of these go into Base?

It’s technically breaking if exported. But I rather want to have this in Base, for discoverability and ease of use to access a single assembly instruction… and doing Base.rsqrt is awkward, so a POC would be nice and PkgEval ASAP.

It depends on the operation. Of the many terms Clause 5 of IEEE 754 uses, one is formatOf, which “specifies the floating-point destination format, which might be different from the floating-point operands’ formats”. Clause 5 specifies formatOf-squareRoot(source1), and while subclause 9.2 does not use these terms for operations like rSqrt, we can infer the same thing from its definition 1/√(x).

However in practice, people do expect matching input and output numeric types when feasible. This is true of Julia’s sqrt:

julia> typeof.(sqrt.(one.([Float16 Float32 Float64 BigFloat])))
1×4 Matrix{DataType}:
 Float16  Float32  Float64  BigFloat

so rsqrt must match to conform to its IEEE definition.

If this is your intent, we can simply support rsqrt(::Float32) without violating the expectation of matching types, and it’s simple and typical to manually convert scalars or collections to Float32 for performance. Consider the possibility that people need rsqrt to support another format like double precision…

then we’d need a breaking change or a deprecation in favor of a new function.

That’d make the function type-unstable, since keyword arguments (let alone their values rather than their types) don’t participate in dispatch.

That comparison doesn’t make much sense to me. Contrary to other arithmetic operations (apart from un-/signedness and overflow details), (non-integer) division isn’t closed over integer numbers, unless you want to throw an error for 1/3 the only practical option is to always return a floating point number for integer input. But here you’re suggesting not respecting the precision of the input, which is a big gotcha.

3 Likes

Furthermore, /(::Int, ::Int) is type stable, so the output type does indeed rely just on the input types.

Sometimes we want most accurate, or appropriate, same type back, and sometimes we just want access to the fastest method.

Most of the time not all of the 53 bits in Float64 are significant, and we just use Float64 out of habit. If only slightly less than half of them are then it makes sense to do calculation/output in Float32. Or top of that, the RSQRTSS only gives “about 11 significant bits of precision”, and it depends on Intel or AMD.

I find it intriguing rsqrt(n) (with its assembly instruction) is actually the faster way to compute sqrt(n) by then multiplying by n, and the combination is faster, another reasons to have rsqrt, and as fast as possible.

Newton-Raphson, take two

Now comes the first key insight: Instead of computing the square root, compute the reciprocal of the square root. […] It turns out that this is a far easier number to compute, and if you need the square root, multiply this number by n and you are done.

I’m going to use Agner Fog’s instruction tables as my source […] So here are the relative costs of some basic floating-point operations. Note that I am only going to deal with non-vector operations for simplicity.

  • […] Double precision floating point add (ADDSD/SUBSD): latency 3, throughput 1
  • Single precision floating point multiply (MULSS): latency 5, throughput 0.5
  • Double precision floating point multiply (MULSD): latency 5, throughput 0.5
  • Single precision floating point divide (DIVSS): latency 10–13, throughput 7
  • Double precision floating point divide (DIVSD): latency 10–20, throughput 8–14
  • […] Single precision floating point square root (SQRTSS): latency 11, throughput 7
  • Double precision floating point square root (SQRTSD): latency 16, throughput 8–14
  • Single precision approximate reciprocal square root (about 11 significant bits of precision, RSQRTSS): latency 3, throughput 1

So sqrt can be calculated in 3+5=8 cycles vs 11 (Float32) or 16 (Float64) (yes, throughput is different, and accuracy is less).

It felt like we needed only half as many digits in the result, since isqrt gives always half as many binary digits, but it actually seems I was wrong on only half as many significant digits:

The number of significant figures in a square root is equal to the number of significant figures in its square

https://proofwiki.org/wiki/Number_of_Significant_Figures_in_Result_of_Square_Root

https://www.reddit.com/r/learnmath/comments/161qt56/how_do_sig_figs_apply_to_exponents_and_square/

https://brainly.in/question/26701421 math - How many Binary Digits needed for X Decimal Digits of Square Root of 2 - Stack Overflow 3.18: Significant Figures in Multiplication and Division - Chemistry LibreTexts

It wouldn’t be a breaking change for a new (rsqrt) function (I wasn’t proposing for ow, to change others, e.g. sqrt), to not promise more than e.g. 11 significant bits by default. It would be unusual, and maybe it needs to be a non-default. I think we’re back to suggesting that, same output type, fully rounded, and non-default for lesser accuracy by keyword argument (still returned in same type likely, I’m though unconvinced the function needs to be type-stable).

That’s not a breaking change because it’s not a change, just API. However, say some platform comes out with a rsqrt instruction for Float64. People would naturally expect rsqrt(::Float64) to use that instruction and return Float64. That would be a breaking change, and it could have been prevented by not using RSQRTSS for other input types other than what its name suggests. Alternatively we could make a rsqrtss or rsqrtFloat32 function just for the clarity, though the implicit type conversion still seems unnecessary.

We’d definitely want that; it’s contradictory to introduce type instability overheads if we’re trying to use a hardware instruction for better performance. Besides, you don’t need a keyword argument for a default value, a trailing positional argument can also serve that purpose AND allows multiple methods.

Possibly, but the reasoning is incorrect. Keyword arguments not participating in dispatch means we can’t specialize the wider function, that is write multiple methods that vary only the keyword arguments. The compiler still specializes a method over its keyword argument’s types. The issue is how it happens. Currently, keyworded calls are implemented by the keyword arguments going into a NamedTuple before being passed to an underlying method with only positional arguments.

  • Specializing over accurate=false only gives us one Bool input type for two possible output types: Float32 or the input value’s type. That’s type-unstable, whether accurate is positional or a keyword makes no difference.
  • If we provide a conversion type instead, the NamedTuple and thus the underlying method only sees DataType. ::Type{T} annotations of the keyword argument do not reach the underlying method and can’t introduce type stability as it could for positional arguments.
  • If we provide a value of the target type instead, say its zero, then its type makes it into the NamedTuple and the underlying method can be compiled for it.
3 Likes

The lower the precision, the harder one has to think about numerical error. Most people use 64-bit float because a lot of the time (but of course not always) it gives reasonable precision without explicitly thinking about the numerical analysis of an algorithm too hard. It is much easier to run into issues with 32-bit float.

Regarding RSQRTSS: if it is ever included in Base, it would be best to highlight that it is an approximation, starting with the name. approximate_rsqrt or something like that would be best. Initially, a ::Float32 method would be sufficient, which would required that users convert explicitly. It does not really make sense to have a generic method for something so specific to a single type.

12 Likes

I don’t understand who “we” are in this story. rsqrt can be used for many things, like normalizing vectors. When your computation goes on for days and weeks, the use of Float64 is not a mere habit, it is a necessity to keep the errors at a manageable level.

There are applications where it’s sufficient to have an approximate square root. Some RISC architectures let the user (i.e. the compiler) do the iterations of some Newton-like method (or Babylonian method, or Henon’s method, whatever, it’s all the same), and two or three iterations might be sufficient in some cases. In other applications, the highest possible precision is needed. I’ve seen older fortran programs in biotech doing REAL*16 computations (emulated) for this purpose.

In case the rsqrt instruction has 32-bit precision, an rsqrt(::Float64) should anyway return a Float64, but properly documented that it might have lower precision. This is important if/when hardware in the future supports 64-bits in its ISA.

2 Likes

I’d just like to clarify that, despite some wrong claims in this thread, sqrt (and rsqrt) are stable operations. Formally: the relative condition numbers \kappa of the evaluation problem is {\cal O}(1). Thus it is meaningful to return Float64 precision for Float64 input, and indeed using a couple of Newton iterations as usual for the sqrt should get you to <1 ULP from the truth. Of course, there is also a place for fast more approximate algorithms. A great place to learn about this and the concept of backward stability is Lectures 12-15 of Trefethen & Bau’s book on Numerical Linear Algebra.

7 Likes