I was playing around with Newton’s method the other day and was mind-blown when I realised that a fairly straightforward Julia implementation of Newton’s method applied to f(x) = 1/x - y comes within roughly a factor 1.5 of inv(y)!

"""
newton_inv(y)
Compute `inv(y)` for `1 <= y <= 2` using Newton's method.
"""
function newton_inv(y)
@fastmath x = 24/17 - (8/17) * y
for k = 1:4
@fastmath x = x * (2 - y*x)
end
return x
end
using BenchmarkTools
using Printf
function benchmark()
@printf(" inv runtime: %6.2f nanoseconds\n", @belapsed( inv($(Ref(1.0))[]))*1e9)
@printf(" Newton runtime: %6.2f nanoseconds\n", @belapsed(newton_inv($(Ref(1.0))[]))*1e9)
end
julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1_000_000;
julia> benchmark() # Laptop plugged in
inv runtime: 1.52 nanoseconds
Newton runtime: 2.13 nanoseconds
julia> benchmark() # Laptop on battery
inv runtime: 2.53 nanoseconds
Newton runtime: 3.37 nanoseconds

Furthermore, newton_inv() appears to produce almost correctly rounded answers:

function accuracy()
y = 1.0 .+ rand(100_000)
x = newton_inv.(y)
println(" Errors: ", unique(abs.(x .- inv.(y))))
end
julia> accuracy()
Errors: [0.0, 1.1102230246251565e-16]

Some details regarding the algorithm are described on the Wikipedia page for division algorithms.

As far as I know this is how a lot of inverse functions are computed, eg look at base/special/cbrt.jl. For other special functions, it looks more like argument reduction + polynomial fit

Note that when you’re testing the error, you really should use bigfloats. Otherwise, you could see errors that are noticably higher if the base version makes different rounding decisions.

The first line of newton_inv(y) indeed just evaluates a linear polynomial with coefficients evaluated at compile-time. I have updated my code to make this clearer.

I have also increased the number of benchmarking samples to make the runtime estimate more accurate. Now I am getting exactly the same runtimes every time I ran benchmark(). Further, I have noticed that the runtime changes depending on whether my laptop is plugged in or on battery power. This is probably not surprising, but it is still fascinating to see this effect so clearly.

inv(y) should be governed by IEEE-754 and return correctly rounded results; hence comparing newton_inv(y) against inv(y) shows to what extent newton_inv(y) implements correct rounding.
Here’s the error if I do compute the reference result using BigFloat:

function accuracy()
y = 1.0 .+ rand(100_000)
x = newton_inv.(y)
println("Errors: ", maximum(abs.(x .- inv.(big.(y)))))
end
julia> accuracy()
Errors: 1.62...e-16

A few more things which I find quite mind-blowing about this example:

newton_inv(y) competes against what is essentially a single machine instruction:

It seems pretty insane that the simple code in newton_inv() can almost compete with what is hardwired into my CPU. For comparison, the cbrt code mentioned by @antoine-levitt first does some bit-fiddling magic, then evaluates some complicated polynomial and finally does two Newton steps to achieve full Float64 accuracy.

newton_inv(y) performs 9 multiplications and 5 additions, all of which have to be executed in serial order, in just 2.13 nanoseconds. My CPU has a max turbo frequency of 4GHz, so if we assume one instruction per cycle and that the additions are merged with some of the multiplications as fused-multiply-adds, this gives us an estimated runtime of 9/4e9 = 2.25 nanoseconds. I’m not 100% sure whether this analysis is indeed sane, but it gives me a warm fuzzy feeling that this number aligns so closely with the empirically observed runtime of 2.13 nanoseconds

It’s probably more complicated than that. While those operations should all be 1/cycle throughput, they will often have 4 cycle or so latency. This is complicated by the fact that when you are benchmarking, Julia is probably interleaving multiple invs (ie it will do the first step for 4 of them, then the next step etc).