# Newton method within factor 1.5 of inv(x)

I was playing around with Newton’s method the other day and was mind-blown when I realised that a fairly straightforward Julia implementation of Newton’s method applied to `f(x) = 1/x - y` comes within roughly a factor 1.5 of `inv(y)`!

``````"""
newton_inv(y)

Compute `inv(y)` for `1 <= y <= 2` using Newton's method.
"""
function newton_inv(y)
@fastmath x = 24/17 - (8/17) * y
for k = 1:4
@fastmath x = x * (2 - y*x)
end
return x
end

using BenchmarkTools
using Printf

function benchmark()
@printf("       inv runtime: %6.2f nanoseconds\n", @belapsed(       inv(\$(Ref(1.0))[]))*1e9)
@printf("    Newton runtime: %6.2f nanoseconds\n", @belapsed(newton_inv(\$(Ref(1.0))[]))*1e9)
end

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1_000_000;

julia> benchmark()   # Laptop plugged in
inv runtime:   1.52 nanoseconds
Newton runtime:   2.13 nanoseconds

julia> benchmark()   # Laptop on battery
inv runtime:   2.53 nanoseconds
Newton runtime:   3.37 nanoseconds

``````

Furthermore, `newton_inv()` appears to produce almost correctly rounded answers:

``````function accuracy()
y = 1.0 .+ rand(100_000)
x = newton_inv.(y)
println("    Errors: ", unique(abs.(x .- inv.(y))))
end

julia> accuracy()
Errors: [0.0, 1.1102230246251565e-16]
``````

Some details regarding the algorithm are described on the Wikipedia page for division algorithms.

Your `newton_inv` uses `inv` (`/`)

1 Like

That /17 is easily converted to a multiplication.

Its nice to see, sometimes I actually see people improve this kind of stuff in julia, I mean, exponentials, roots…

2 Likes

As far as I know this is how a lot of inverse functions are computed, eg look at base/special/cbrt.jl. For other special functions, it looks more like argument reduction + polynomial fit

3 Likes

Note that when you’re testing the error, you really should use bigfloats. Otherwise, you could see errors that are noticably higher if the base version makes different rounding decisions.

3 Likes

The first line of `newton_inv(y)` indeed just evaluates a linear polynomial with coefficients evaluated at compile-time. I have updated my code to make this clearer.

I have also increased the number of benchmarking samples to make the runtime estimate more accurate. Now I am getting exactly the same runtimes every time I ran `benchmark()`. Further, I have noticed that the runtime changes depending on whether my laptop is plugged in or on battery power. This is probably not surprising, but it is still fascinating to see this effect so clearly.

`inv(y)` should be governed by IEEE-754 and return correctly rounded results; hence comparing `newton_inv(y)` against `inv(y)` shows to what extent `newton_inv(y)` implements correct rounding.
Here’s the error if I do compute the reference result using `BigFloat`:

``````function accuracy()
y = 1.0 .+ rand(100_000)
x = newton_inv.(y)
println("Errors: ", maximum(abs.(x .- inv.(big.(y)))))
end

julia> accuracy()
Errors: 1.62...e-16
``````

• `newton_inv(y)` competes against what is essentially a single machine instruction:
``````julia> @code_native inv(1.5)
.text
; ┌ @ number.jl:199 within `inv'
vmovsd  7369968(%rip), %xmm1    # xmm1 = mem[0],zero
; │┌ @ float.jl:407 within `/'
vdivsd  %xmm0, %xmm1, %xmm0
; │└
retq
nopl    (%rax)
; └
``````

It seems pretty insane that the simple code in `newton_inv()` can almost compete with what is hardwired into my CPU. For comparison, the `cbrt` code mentioned by @antoine-levitt first does some bit-fiddling magic, then evaluates some complicated polynomial and finally does two Newton steps to achieve full `Float64` accuracy.

• `newton_inv(y)` performs 9 multiplications and 5 additions, all of which have to be executed in serial order, in just 2.13 nanoseconds. My CPU has a max turbo frequency of 4GHz, so if we assume one instruction per cycle and that the additions are merged with some of the multiplications as fused-multiply-adds, this gives us an estimated runtime of `9/4e9 = 2.25` nanoseconds. I’m not 100% sure whether this analysis is indeed sane, but it gives me a warm fuzzy feeling that this number aligns so closely with the empirically observed runtime of 2.13 nanoseconds

It’s probably more complicated than that. While those operations should all be 1/cycle throughput, they will often have 4 cycle or so latency. This is complicated by the fact that when you are benchmarking, Julia is probably interleaving multiple `inv`s (ie it will do the first step for 4 of them, then the next step etc).

3 Likes