Newton method within factor 1.5 of inv(x)

ettersi · January 30, 2021, 9:49am

I was playing around with Newton’s method the other day and was mind-blown when I realised that a fairly straightforward Julia implementation of Newton’s method applied to f(x) = 1/x - y comes within roughly a factor 1.5 of inv(y)!

"""
    newton_inv(y)

Compute `inv(y)` for `1 <= y <= 2` using Newton's method.
"""
function newton_inv(y)
    @fastmath x = 24/17 - (8/17) * y
    for k = 1:4
        @fastmath x = x * (2 - y*x)
    end
    return x
end

using BenchmarkTools
using Printf

function benchmark()
    @printf("       inv runtime: %6.2f nanoseconds\n", @belapsed(       inv($(Ref(1.0))[]))*1e9)
    @printf("    Newton runtime: %6.2f nanoseconds\n", @belapsed(newton_inv($(Ref(1.0))[]))*1e9)
end

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1_000_000;

julia> benchmark()   # Laptop plugged in
      inv runtime:   1.52 nanoseconds
   Newton runtime:   2.13 nanoseconds

julia> benchmark()   # Laptop on battery
      inv runtime:   2.53 nanoseconds
   Newton runtime:   3.37 nanoseconds

Furthermore, newton_inv() appears to produce almost correctly rounded answers:

function accuracy()
    y = 1.0 .+ rand(100_000)
    x = newton_inv.(y)
    println("    Errors: ", unique(abs.(x .- inv.(y))))
end

julia> accuracy()
    Errors: [0.0, 1.1102230246251565e-16]

Some details regarding the algorithm are described on the Wikipedia page for division algorithms.

baggepinnen · January 30, 2021, 10:23am

Your newton_inv uses inv (/)

smldis · January 30, 2021, 11:02am

That /17 is easily converted to a multiplication.

Its nice to see, sometimes I actually see people improve this kind of stuff in julia, I mean, exponentials, roots…

antoine-levitt · January 30, 2021, 1:04pm

As far as I know this is how a lot of inverse functions are computed, eg look at base/special/cbrt.jl. For other special functions, it looks more like argument reduction + polynomial fit

Oscar_Smith · January 30, 2021, 3:00pm

Note that when you’re testing the error, you really should use bigfloats. Otherwise, you could see errors that are noticably higher if the base version makes different rounding decisions.

ettersi · January 31, 2021, 1:44am

The first line of newton_inv(y) indeed just evaluates a linear polynomial with coefficients evaluated at compile-time. I have updated my code to make this clearer.

I have also increased the number of benchmarking samples to make the runtime estimate more accurate. Now I am getting exactly the same runtimes every time I ran benchmark(). Further, I have noticed that the runtime changes depending on whether my laptop is plugged in or on battery power. This is probably not surprising, but it is still fascinating to see this effect so clearly.

inv(y) should be governed by IEEE-754 and return correctly rounded results; hence comparing newton_inv(y) against inv(y) shows to what extent newton_inv(y) implements correct rounding.
Here’s the error if I do compute the reference result using BigFloat:

function accuracy()
    y = 1.0 .+ rand(100_000)
    x = newton_inv.(y)
    println("Errors: ", maximum(abs.(x .- inv.(big.(y)))))
end

julia> accuracy()
Errors: 1.62...e-16

A few more things which I find quite mind-blowing about this example:

newton_inv(y) competes against what is essentially a single machine instruction:

julia> @code_native inv(1.5)
        .text
; ┌ @ number.jl:199 within `inv'
        vmovsd  7369968(%rip), %xmm1    # xmm1 = mem[0],zero
; │┌ @ float.jl:407 within `/'
        vdivsd  %xmm0, %xmm1, %xmm0
; │└
        retq
        nopl    (%rax)
; └

It seems pretty insane that the simple code in newton_inv() can almost compete with what is hardwired into my CPU. For comparison, the cbrt code mentioned by @antoine-levitt first does some bit-fiddling magic, then evaluates some complicated polynomial and finally does two Newton steps to achieve full Float64 accuracy.

newton_inv(y) performs 9 multiplications and 5 additions, all of which have to be executed in serial order, in just 2.13 nanoseconds. My CPU has a max turbo frequency of 4GHz, so if we assume one instruction per cycle and that the additions are merged with some of the multiplications as fused-multiply-adds, this gives us an estimated runtime of 9/4e9 = 2.25 nanoseconds. I’m not 100% sure whether this analysis is indeed sane, but it gives me a warm fuzzy feeling that this number aligns so closely with the empirically observed runtime of 2.13 nanoseconds

Oscar_Smith · January 31, 2021, 1:57am

It’s probably more complicated than that. While those operations should all be 1/cycle throughput, they will often have 4 cycle or so latency. This is complicated by the fact that when you are benchmarking, Julia is probably interleaving multiple invs (ie it will do the first step for 4 of them, then the next step etc).