How to convert the binary form of a float to an integer, and vice versa?

I am trying to implement the “fast inverse square root” algorithm in Julia:

function rsqrt(x::Float32)
    xₛ = x
    int32 = reinterpret(UInt32, xₛ)
    int32 = 0x5f3759df - int32 >> 1
    xₛ = reinterpret(Float32, int32)
    xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
    return xₛ
end

But the result is definitely not right:

julia> rsqrt(9.0f0)
0.3329532f0

Am I using reinterpret wrongly? If so, what is the correct usage?

The C code gives the same answer, I don’t think this is an exact algorithm(?).

$ cat main.c 
#include <stdio.h>

float Q_rsqrt( float number )
{
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;                       // evil floating point bit level hacking
        i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//      y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

        return y;
}

int main(void)
{
    float x = 9;
    fprintf(stdout, "%f\n", Q_rsqrt(x));
    return 0;
}

$ cat main.jl 
function rsqrt(x::Float32)
    xₛ = x
    int32 = reinterpret(UInt32, xₛ)
    int32 = 0x5f3759df - int32 >> 1
    xₛ = reinterpret(Float32, int32)
    xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
    return xₛ
end

function main()
    x = 9.0f0
    println(rsqrt(x))
end

if abspath(PROGRAM_FILE) == @__FILE__
    main()
end

$ gcc -o mainc main.c

$ ./mainc
0.332953

$ julia main.jl 
0.3329532
2 Likes

It’s pretty close?

julia> rsqrt(9.0f0)
0.3329532f0

julia> 1/sqrt(9)
0.3333333333333333

And if you do another Newton iteration you get closer.

5 Likes

Yeah, it is not an exact algorithm. Sorry, I was somehow comparing it with sqrt. My mistake.

1 Like

By coincidence I was implementing the same algorithm 5 days ago :smiley:

function fast_sqrt(y::Float32)
    x2 = y * 0.5f0
    i = reinterpret(Int32, y)
    i = 0x5f3759df - (i >> 1)
    y = reinterpret(Float32, i)
    y = y * (1.5f0 - x2 * y * y)
    return y
end

I was going to make it a package. I am surprised that Julia does not have this function built in?

julia> @macroexpand @fastmath 1.0f0 / sqrt(2.0f0)
:(Base.FastMath.div_fast(1.0f0, Base.FastMath.sqrt_fast(2.0f0)))

julia> @macroexpand @fastmath inv(sqrt(2.0f0))
:(Base.FastMath.inv_fast(Base.FastMath.sqrt_fast(2.0f0)))
julia> @code_llvm @fastmath 1.0f0 / sqrt(2.0f0)
;  @ fastmath.jl:154 within `@fastmath'
define nonnull {}* @"julia_@fastmath_537"({ i64, {}* }* nocapture nonnull readonly align 8 dereferenceable(16) %0, {}* nonnull %1, {}* nonnull align 8 dereferenceable(16) %2) {
top:
  %3 = alloca [2 x {}*], align 8
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %.sub = getelementptr inbounds [2 x {}*], [2 x {}*]* %3, i64 0, i64 0
  %4 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %4, i8 0, i32 24, i1 false)
  %5 = call {}*** inttoptr (i64 4314951312 to {}*** ()*)() #4
;  @ fastmath.jl:155 within `@fastmath'
; ┌ @ essentials.jl:480 within `esc'
; │┌ @ boot.jl:253 within `Expr'
    %6 = bitcast [3 x {}*]* %gcframe2 to i64*
    store i64 4, i64* %6, align 16
    %7 = bitcast {}*** %5 to i64*
    %8 = load i64, i64* %7, align 8
    %9 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
    %10 = bitcast {}** %9 to i64*
    store i64 %8, i64* %10, align 8
    %11 = bitcast {}*** %5 to {}***
    store {}** %gcframe2.sub, {}*** %11, align 8
    store {}* inttoptr (i64 4310591112 to {}*), {}** %.sub, align 8
    %12 = getelementptr inbounds [2 x {}*], [2 x {}*]* %3, i64 0, i64 1
    store {}* %2, {}** %12, align 8
    %13 = call nonnull {}* @jl_f__expr({}* null, {}** nonnull %.sub, i32 2)
    %14 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
    store {}* %13, {}** %14, align 16
; └└
  store {}* %13, {}** %.sub, align 8
  %15 = call nonnull {}* @j1_make_fastmath_539({}* inttoptr (i64 4718928432 to {}*), {}** nonnull %.sub, i32 1)
  %16 = load i64, i64* %10, align 8
  store i64 %16, i64* %7, align 8
  ret {}* %15
}
1 Like

I mean it’s only an approximation and if you compare performance:

using BenchmarkTools

function fast_sqrt(y::Float32)
    x2 = y * 0.5f0
    i = reinterpret(Int32, y)
    i = 0x5f3759df - (i >> 1)
    y = reinterpret(Float32, i)
    y = y * (1.5f0 - x2 * y^2)
    return y
end


function compare_inv_sqrt(n=100000)
    arr = abs.(10000 .* randn(Float32, n)) 
    @btime fast_sqrt.($arr)
    @btime 1 ./ sqrt.($arr)
    @btime @fastmath 1 ./ sqrt.($arr)
    return nothing
end

## REPL
julia> compare_inv_sqrt()
  23.092 μs (2 allocations: 390.70 KiB)
  154.720 μs (2 allocations: 390.70 KiB)
  36.811 μs (2 allocations: 390.70 KiB)

Performance is only like ~40% better.

I think the speed difference was much more drastic on older hardware.

2 Likes

Also x86 has a dedicated instruction for this that is more accurate (14 bits) and faster. RSQRTSS — Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value

3 Likes

Would maybe be reasonable to expose this. The only trouble is what to do on non-x86 hardware. You’d want a software implementation of that instruction that’s as accurate but not too slow.

Yeah, but it would be really interesting to have this smart trick somewhere. And we can have it for Float64 (it’s easy), 128 bits floating numbers (implemented by MultiFloats.jl, etc… It would not be easy for me to do this by myself.

The reason why I want Julia to officially support this is that I hope there can be a way for it to work on 128 bits floating numbers and more. I do not know about the internals of Julia so I can only implement it for Float32, Float64, etc., one by one.

This is a really bad approach for higher precision than Float32. If you only care about a rough approximation, you shouldn’t be using MultiFloats and if you care about precision, there are better algorithms to compute it accurately.

I guess there are methods for 64 bits numbers since someone found the magic number (0x5FE6EB50C7B537A9). But probably you are right, we do not know the magic number for 128 bits and more.

There’s nothing particularly magic about it, it’s just a solution to an optimization problem. For Float32 you can do an exhaustive search, for more bits you can get a good enough solution with a coarse search and interval halving refinement.

But Oscar is right, there’s little point in doing this for MultiFloats. And if you insist that you need a really coarse approximation for your high precision number, just cast your input to Float32, do your thing, cast it back to MultiFloat and call it a day.

2 Likes

Yeah, I know how it is got. I conformed with the name “magic number” from Wikipedia.

Yes, the rsqrt x86 instruction is much faster than this.
AVX512 also provides a slightly different, slightly more accurate version, as well as one for Float64, both with 2^-14 maximum error. This is just enough to require 1 less Newton iteration for certain accuracy targets.

So there’s no reason to use the Quake algorithm on any recent x86 Intel or AMD CPU.

julia> x = rand(Float32, 1024); y = similar(x);

julia> function rsqrt(x::Float32)
           xₛ = x
           int32 = reinterpret(UInt32, xₛ)
           int32 = 0x5f3759df - int32 >> 1
           xₛ = reinterpret(Float32, int32)
           xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
           xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
           return xₛ
       end
rsqrt (generic function with 1 method)

julia> function invsqrt_approx!(y,x)
           @inbounds for i ∈ eachindex(y,x)
               y[i] = rsqrt(x[i])
           end
       end
invsqrt_approx! (generic function with 1 method)

julia> function invsqrt!(y,x)
           @fastmath @inbounds for i ∈ eachindex(y,x)
               y[i] = 1/sqrt(x[i])
           end
       end
invsqrt! (generic function with 1 method)

julia> @benchmark invsqrt_approx!($y, $x)
BenchmarkTools.Trial: 10000 samples with 886 evaluations.
 Range (min … max):  129.239 ns … 190.010 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.896 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.920 ns ±   0.743 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▃▂▆█▇▁
  ▂▂▃▄▄▃▃▃▃▂▂▃███████▃▂▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  129 ns           Histogram: frequency by time          132 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> sum(abs, y .- 1 ./ sqrt.(Float64.(x)))
0.003846986966078658

julia> @benchmark invsqrt!($y, $x)
BenchmarkTools.Trial: 10000 samples with 956 evaluations.
 Range (min … max):  89.877 ns … 143.095 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.974 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.048 ns ±   0.668 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▆▂   ▁▄▄                                                 ▂
  ▆█████▁▁▃████▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▆▅▅▇▇▇▇▇▇▇▆▇ █
  89.9 ns       Histogram: log(frequency) by time      91.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> sum(abs, y .- 1 ./ sqrt.(Float64.(x)))
6.114518983602046e-5

@fastmath version + 1 newton iteration is much faster and more accurate than this approximation + 2 newton iterations.
The compiler does not use the Quake algorithm because it is not a good idea to do so.

Note that I added a second newton iteration to improve accuracy (which makes it slower, and still less accurate).

I’d probably warn people about the intrinsic because it is really quite inaccurate.
There is a similar – very fast but inaccurate – approximation for approximate reciprocals.
These can be useful for implementing certain math functions, e.g. for reduction steps where you’re concerned more about splitting a number into pieces than the precise allocation of said pieces.

5 Likes