How to convert the binary form of a float to an integer, and vice versa?

singularitti · October 20, 2021, 8:05am

I am trying to implement the “fast inverse square root” algorithm in Julia:

function rsqrt(x::Float32)
    xₛ = x
    int32 = reinterpret(UInt32, xₛ)
    int32 = 0x5f3759df - int32 >> 1
    xₛ = reinterpret(Float32, int32)
    xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
    return xₛ
end

But the result is definitely not right:

julia> rsqrt(9.0f0)
0.3329532f0

Am I using reinterpret wrongly? If so, what is the correct usage?

fredrikekre · October 20, 2021, 8:32am

The C code gives the same answer, I don’t think this is an exact algorithm(?).

$ cat main.c 
#include <stdio.h>

float Q_rsqrt( float number )
{
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;                       // evil floating point bit level hacking
        i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//      y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

        return y;
}

int main(void)
{
    float x = 9;
    fprintf(stdout, "%f\n", Q_rsqrt(x));
    return 0;
}

$ cat main.jl 
function rsqrt(x::Float32)
    xₛ = x
    int32 = reinterpret(UInt32, xₛ)
    int32 = 0x5f3759df - int32 >> 1
    xₛ = reinterpret(Float32, int32)
    xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
    return xₛ
end

function main()
    x = 9.0f0
    println(rsqrt(x))
end

if abspath(PROGRAM_FILE) == @__FILE__
    main()
end

$ gcc -o mainc main.c

$ ./mainc
0.332953

$ julia main.jl 
0.3329532

kristoffer.carlsson · October 20, 2021, 8:38am

It’s pretty close?

julia> rsqrt(9.0f0)
0.3329532f0

julia> 1/sqrt(9)
0.3333333333333333

And if you do another Newton iteration you get closer.

singularitti · October 20, 2021, 6:16pm

Yeah, it is not an exact algorithm. Sorry, I was somehow comparing it with sqrt. My mistake.

roflmaostc · October 20, 2021, 6:36pm

By coincidence I was implementing the same algorithm 5 days ago

function fast_sqrt(y::Float32)
    x2 = y * 0.5f0
    i = reinterpret(Int32, y)
    i = 0x5f3759df - (i >> 1)
    y = reinterpret(Float32, i)
    y = y * (1.5f0 - x2 * y * y)
    return y
end

singularitti · October 20, 2021, 6:38pm

I was going to make it a package. I am surprised that Julia does not have this function built in?

julia> @macroexpand @fastmath 1.0f0 / sqrt(2.0f0)
:(Base.FastMath.div_fast(1.0f0, Base.FastMath.sqrt_fast(2.0f0)))

julia> @macroexpand @fastmath inv(sqrt(2.0f0))
:(Base.FastMath.inv_fast(Base.FastMath.sqrt_fast(2.0f0)))

julia> @code_llvm @fastmath 1.0f0 / sqrt(2.0f0)
;  @ fastmath.jl:154 within `@fastmath'
define nonnull {}* @"julia_@fastmath_537"({ i64, {}* }* nocapture nonnull readonly align 8 dereferenceable(16) %0, {}* nonnull %1, {}* nonnull align 8 dereferenceable(16) %2) {
top:
  %3 = alloca [2 x {}*], align 8
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %.sub = getelementptr inbounds [2 x {}*], [2 x {}*]* %3, i64 0, i64 0
  %4 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %4, i8 0, i32 24, i1 false)
  %5 = call {}*** inttoptr (i64 4314951312 to {}*** ()*)() #4
;  @ fastmath.jl:155 within `@fastmath'
; ┌ @ essentials.jl:480 within `esc'
; │┌ @ boot.jl:253 within `Expr'
    %6 = bitcast [3 x {}*]* %gcframe2 to i64*
    store i64 4, i64* %6, align 16
    %7 = bitcast {}*** %5 to i64*
    %8 = load i64, i64* %7, align 8
    %9 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
    %10 = bitcast {}** %9 to i64*
    store i64 %8, i64* %10, align 8
    %11 = bitcast {}*** %5 to {}***
    store {}** %gcframe2.sub, {}*** %11, align 8
    store {}* inttoptr (i64 4310591112 to {}*), {}** %.sub, align 8
    %12 = getelementptr inbounds [2 x {}*], [2 x {}*]* %3, i64 0, i64 1
    store {}* %2, {}** %12, align 8
    %13 = call nonnull {}* @jl_f__expr({}* null, {}** nonnull %.sub, i32 2)
    %14 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
    store {}* %13, {}** %14, align 16
; └└
  store {}* %13, {}** %.sub, align 8
  %15 = call nonnull {}* @j1_make_fastmath_539({}* inttoptr (i64 4718928432 to {}*), {}** nonnull %.sub, i32 1)
  %16 = load i64, i64* %10, align 8
  store i64 %16, i64* %7, align 8
  ret {}* %15
}

roflmaostc · October 20, 2021, 6:43pm

I mean it’s only an approximation and if you compare performance:

using BenchmarkTools

function fast_sqrt(y::Float32)
    x2 = y * 0.5f0
    i = reinterpret(Int32, y)
    i = 0x5f3759df - (i >> 1)
    y = reinterpret(Float32, i)
    y = y * (1.5f0 - x2 * y^2)
    return y
end


function compare_inv_sqrt(n=100000)
    arr = abs.(10000 .* randn(Float32, n)) 
    @btime fast_sqrt.($arr)
    @btime 1 ./ sqrt.($arr)
    @btime @fastmath 1 ./ sqrt.($arr)
    return nothing
end

## REPL
julia> compare_inv_sqrt()
  23.092 μs (2 allocations: 390.70 KiB)
  154.720 μs (2 allocations: 390.70 KiB)
  36.811 μs (2 allocations: 390.70 KiB)

Performance is only like ~40% better.

StefanKarpinski · October 20, 2021, 6:55pm

I think the speed difference was much more drastic on older hardware.

Oscar_Smith · October 20, 2021, 6:55pm

Also x86 has a dedicated instruction for this that is more accurate (14 bits) and faster. RSQRTSS — Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value

StefanKarpinski · October 20, 2021, 7:01pm

Would maybe be reasonable to expose this. The only trouble is what to do on non-x86 hardware. You’d want a software implementation of that instruction that’s as accurate but not too slow.

singularitti · October 20, 2021, 7:04pm

Yeah, but it would be really interesting to have this smart trick somewhere. And we can have it for Float64 (it’s easy), 128 bits floating numbers (implemented by MultiFloats.jl, etc… It would not be easy for me to do this by myself.

singularitti · October 20, 2021, 7:07pm

The reason why I want Julia to officially support this is that I hope there can be a way for it to work on 128 bits floating numbers and more. I do not know about the internals of Julia so I can only implement it for Float32, Float64, etc., one by one.

Oscar_Smith · October 20, 2021, 7:09pm

This is a really bad approach for higher precision than Float32. If you only care about a rough approximation, you shouldn’t be using MultiFloats and if you care about precision, there are better algorithms to compute it accurately.

singularitti · October 20, 2021, 7:14pm

I guess there are methods for 64 bits numbers since someone found the magic number (0x5FE6EB50C7B537A9). But probably you are right, we do not know the magic number for 128 bits and more.

GunnarFarneback · October 20, 2021, 7:54pm

There’s nothing particularly magic about it, it’s just a solution to an optimization problem. For Float32 you can do an exhaustive search, for more bits you can get a good enough solution with a coarse search and interval halving refinement.

But Oscar is right, there’s little point in doing this for MultiFloats. And if you insist that you need a really coarse approximation for your high precision number, just cast your input to Float32, do your thing, cast it back to MultiFloat and call it a day.

singularitti · October 20, 2021, 8:07pm

Yeah, I know how it is got. I conformed with the name “magic number” from Wikipedia.

Elrod · October 20, 2021, 10:17pm

Yes, the rsqrt x86 instruction is much faster than this.
AVX512 also provides a slightly different, slightly more accurate version, as well as one for Float64, both with 2^-14 maximum error. This is just enough to require 1 less Newton iteration for certain accuracy targets.

So there’s no reason to use the Quake algorithm on any recent x86 Intel or AMD CPU.

julia> x = rand(Float32, 1024); y = similar(x);

julia> function rsqrt(x::Float32)
           xₛ = x
           int32 = reinterpret(UInt32, xₛ)
           int32 = 0x5f3759df - int32 >> 1
           xₛ = reinterpret(Float32, int32)
           xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
           xₛ *= 1.5f0 - x * 0.5f0 * xₛ^2
           return xₛ
       end
rsqrt (generic function with 1 method)

julia> function invsqrt_approx!(y,x)
           @inbounds for i ∈ eachindex(y,x)
               y[i] = rsqrt(x[i])
           end
       end
invsqrt_approx! (generic function with 1 method)

julia> function invsqrt!(y,x)
           @fastmath @inbounds for i ∈ eachindex(y,x)
               y[i] = 1/sqrt(x[i])
           end
       end
invsqrt! (generic function with 1 method)

julia> @benchmark invsqrt_approx!($y, $x)
BenchmarkTools.Trial: 10000 samples with 886 evaluations.
 Range (min … max):  129.239 ns … 190.010 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.896 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.920 ns ±   0.743 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▃▂▆█▇▁
  ▂▂▃▄▄▃▃▃▃▂▂▃███████▃▂▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  129 ns           Histogram: frequency by time          132 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> sum(abs, y .- 1 ./ sqrt.(Float64.(x)))
0.003846986966078658

julia> @benchmark invsqrt!($y, $x)
BenchmarkTools.Trial: 10000 samples with 956 evaluations.
 Range (min … max):  89.877 ns … 143.095 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.974 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.048 ns ±   0.668 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▆▂   ▁▄▄                                                 ▂
  ▆█████▁▁▃████▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▆▅▅▇▇▇▇▇▇▇▆▇ █
  89.9 ns       Histogram: log(frequency) by time      91.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> sum(abs, y .- 1 ./ sqrt.(Float64.(x)))
6.114518983602046e-5

@fastmath version + 1 newton iteration is much faster and more accurate than this approximation + 2 newton iterations.
The compiler does not use the Quake algorithm because it is not a good idea to do so.

Note that I added a second newton iteration to improve accuracy (which makes it slower, and still less accurate).

I’d probably warn people about the intrinsic because it is really quite inaccurate.
There is a similar – very fast but inaccurate – approximation for approximate reciprocals.
These can be useful for implementing certain math functions, e.g. for reduction steps where you’re concerned more about splitting a number into pieces than the precise allocation of said pieces.

Topic		Replies	Views
Should Julia have rsqrt? And e.g. support RSQRTSS Internals & Design	12	979	February 20, 2025
Disabling DomainError (in sqrt())? General Usage error , math , inbounds	21	1831	October 17, 2022
In Binary Floating Point, you can multiply without multiplying Offtopic math	22	776	October 25, 2024
Fast division by 2 Numerics	13	469	November 3, 2024
Poll: speed vs accuracy for `Float64^-3` Performance poll , math , float	55	2698	March 31, 2022

How to convert the binary form of a float to an integer, and vice versa?

Related topics