# Efficient implementations of Fibonacci function with interesting results

#1

It is well known that the standard recursive implementation of the Fibonacci function defined by

F(0) = 0
F(1) = 1
F(n) = F(n-1) + F(n-2) for n > 1


runs in exponential time. Therefore most often the Fibonacci function is implemented either iteratively

function itfib(n)
x,y = BigInt(0), BigInt(1)
for i in 1:n
x,y = y, x+y
end
return x
end


or by matrix exponentation

const b = [ BigInt(1) 1; 1 0]

function mfib(n)
if n == 0
return 0
elseif n == 1
return 1
else
return (b^(n+1))[2,2]
end
end


The underlying algorithms run in linear time. Using a divide-and-conquer strategy for matrix exponentation we get an algorithm running in time O(log n):

function pow(a,n)   # n > 0 !
if n == 1
return a
else
c = pow(a, n >> 1)
if iseven(n)
return c*c
else
return c*c * a
end
end
end

function afib(n)
if n == 0
return 0
elseif n == 1
return 1
else
return (pow(b,n+1))[2,2]
end
end


Another implementation running in time O(log n) uses for n>1Dijkstra’s recurrence equation

F(2n)  = F(n) ( 2F(n-1) + F(n) )

F(2n+1) = F(n)^2 + F(n+1)^2


and memoization:

const fibmem = Dict{Int,BigInt}()
function fib(n)
get!(fibmem, n) do
if n <= 0
return BigInt(0)
elseif n == 1
return BigInt(1)
else
m = n >> 1
if iseven(n)
return fib(m)*(2*fib(m-1) + fib(m))
else
return fib(m+1)^2 + fib(m)^2
end
end
end
end


I tested the iterative version itfib for 10^7 and all the other version for 10^9 and got the following results

@elapsed(fib(10^9)) = 12.006108423

@elapsed(afib(10^9)) = 52.248328254

@elapsed(mfib(10^9)) = 78.922591838

@elapsed(itfib(10^7)) = 882.27293546


I was a little bit surprised that Dijkstra’s recurrence relation together with memoization was so much faster than the efficient matrix exponentation method (afib). I was also surprised that the standard matrix exponenation method (mfib) was only a “bit slower” than afib, but much faster than the iterative implementation. What are the reasons for these facts?

#2

The fastest method to compute these is the Binet formula F_n = \frac1{\sqrt5}(\Phi^n-\Phi^{-n}), where \Phi = \frac{1+\sqrt5}2.

julia> F(n) = ((1+sqrt(big(5)))^n-(1-sqrt(big(5)))^n)/(sqrt(big(5))*big(2)^n)
F (generic function with 1 method)

julia> @btime F(10^7);
167.953 μs (25 allocations: 1.19 MiB)


It’s slightly slower if you convert it to BigInt afterwards.

#3

Well. You can express Fibonacci sequence as powering a matrix, but you can eigen-decompose the matrix into A^{-1}DA where D is diagonal, so powering that is dead easy! that’s one way to derive the Binet formula above.

#4

The Binet formula is useless if you want an exact integer result for large n.

#5

The fastest method in Julia is probably to call the optimized C implementation from the underlying libgmp library:

function fastfib(n)
z = BigInt()
ccall((:__gmpz_fib_ui, :libgmp), Cvoid, (Ref{BigInt}, Culong), z, n)
return z
end


This is about 50% faster than your memo-ized fib implementation above (which means your fib function is quite good)!

#6

Actually, the formula produces an EXACT integer, if you have trouble getting one, it is due to the limitations of your floating point arithmetic.

#7

I was under the impression that this was @stevengj’s point, as all floating point arithmetic has limitations of some kind, starting from the fact that the golden ratio is irrational, and it only gets worse from there.

#8

Right. You need rapidly increasing precision with n in order to guarantee a correct rounded result. Of course, this is possible to do with BigFloat via an appropriate call to setprecision, but that probably negates any performance advantage of this approach. Just using the default BigFloat precision of 256 bits (about 77 decimal digits), as @chakravala did above for n=10^7, gives the wrong results.

For example, I tried repeatedly doubling the BigFloat precision until the Binet formula gave an exact result (when rounded to BigInt) for n=10^7, and it required a precision of 8388608 bits. At this precision, the Binet F(n) function posted by @chakravala takes 3.2 seconds on my machine, whereas fastfib(n) takes 0.05 seconds.

For n=10^9 as in the original post (for which fastfib takes about 8 seconds), I tried repeatedly doubling the precision to get an exact answer from F(n), but I ran out of patience after waiting a few minutes.

#9

It’s good that calling out to C is so easy, but part of Julia’s elevator pitch is that it solves the two language problem in data science. Can that 50% difference be reduced much more by a pure Julia function?

#10

@stevengj the performance is bad because you are using naive implementations. It would probably be much quicker using arb with Nemo package and using the Chakravala algorithm to approximate the square root. The chakravala algorithm combined with the bhavana principle will significantly outperform the rate of convergence of regular continued fractions, and it has not even been tried to apply it to the square root of 5 here to obtain the accuracy of the roots to a higher precision in shorter time. Perhaps, I will give it a try when I have some free time, I will try to see if the chakravala algorithm and bhavana principle can be used to converge quicker than these by using them to compute square root of 5 for Binet’s formula, also using the arbitrary precision facilities of Nemo.

#11

Probably, but with some caveats. First, of course, you’d want to look at the GMP implementation to see if they are using any high-level algorithmic tricks that can easily be copied.

However, realize that the BigInt type in Julia is based on an external C library (GMP), not pure Julia code, so to get the maximum possible optimizations you may need to call low-level GMP functions. In particular low-level GMP code can eke out some additional performance gains by careful use of in-place operations on pre-allocated BigInt objects (for example, by re-using the same object for more than one calculation).

Automating those kind of gains seems like it will require lower-level support for memory management of bignum objects in the Julia language implementation, which has not been a high priority of most of the developers so far. See, for example, #4176 and #11202.

(If you compare the code length and clarity for the GMP C code and the Julia fib(n) implementation, in my opinion a 50% performance penalty is quite good for such short and readable code. In general, getting the “last factor of 2” for a highly-optimized implementation of a nontrivial algorithm often takes an unreasonable amount of code in any language, and is rarely worth it except for critical library functions…you’re usually better off focusing on extensibility, generality, and clarity.)

#12

The sqrt function for BigFloat is calling mpfr_sqrt from the MPFR library; if MPFR is using a “naive” algorithm, you should file an issue with MPFR, but I’m skeptical of that assertion without evidence.

#13

I agree, but I’m curious about whether the same “unreasonable” code can be written in Julia as it can in C or C++. Could BigInt have been written in pure Julia, given the manpower and time, or will two languages always be required to achieve the highest performance?

#14

Yes, probably, but it would require a huge amount of effort to replicate GMP.

One tricky thing to replicate in “pure” Julia is code that uses inline assembly to access specialized CPU instructions, e.g. the crc32c function in Julia’s CRC32c standard library calls a highly optimized C implementation that utilizes inline assembly to access the crc32c instruction of modern x86 processors. It is possible to access this in Julia via the Base.llvmcall “escape hatch” that lets you insert LLVM instructions directly, but there was not much point in rewriting the C in this case.

Another difficulty is to replicate code that uses low-level thread management, since you can’t call these directly from Julia without worrying about the thread-safety of the Julia runtime. However, the native Julia multithreading support has been getting better by leaps and bounds, so there’s not much need for lower level access anymore.

In general, excluding libraries like GMP or OpenBLAS or FFTW that involve tens of thousands of lines of low-level code and require a huge engineering effort to replicate, I can’t recall any case where careful tuning of Julia code was not able to replicate the performance of C or Fortran within a few percent. I’ve sometimes found that one can even exceed the speed of optimized C or Fortran, because Julia features like metaprogramming allow us to code in a style that would be difficult to replicate in a lower-level language.

#15

I wrote a bit about calling intrinsics in http://kristofferc.github.io/post/intrinsics/ (under the header “Using intrinsics”). It can be done slightly simpler than in the linked version:

julia> crc32(crc::Int32, x::Int32) = ccall("llvm.x86.sse42.crc32.32.32", llvmcall, Int32, (Int32, Int32), crc, x);

julia> crc32(Int32(2), Int32(3))
-582636872


#16

Julia already uses the efficient power-by-squaring method for matrix exponentiation. You can use the Base implementation to compute integer powers of any type that implements * by calling Base.power_by_squaring.

#17

Most of the time for the Binet formula calculation actually goes into the exponentiation:

julia> using Nemo; ff = RealField(8388608);

julia> @time five = sqrt(ff(5));
0.129010 seconds (1.77 k allocations: 41.310 MiB, 0.66% gc time)

julia> @time (1+five)^(10^7);
1.668563 seconds (109 allocations: 289.760 MiB, 3.17% gc time)


So that is where the problem is with the Binet formula, the problem is actually not with the square root.

#18

The BigFloat^Integer operation is also done by calling an optimized MPFR routine. In short, I don’t think there is any problem with the implementation being especially naive here—although you could save a few operations by being more careful, there’s no way to rewrite the Binet calculations in floating-point arithmetic to save orders of magnitude in time. The problem, as I said, is that the Binet formula requires such high precision that it is effectively useless (compared to BigInt methods) if you want an exact result. (It is very nice if you just want an approximate answer, however!)

Another way to see a problem with the Binet formula is to look at the complexity. Exponentiation via power-by-squaring requires O(log n) multiplications, so at first glance it may seem comparable to the O(log n) BigInt calculations for matrix exponentiation or for Dijkstra’s recurrence with memoization. This is deceptive, however, because not all bignum multiplications have the same cost. For the BigInt calculations, the numbers start small (hence cheap) and get bigger (more expensive) towards the end. For Binet’s formula via BigFloat arithmetic, however, you have to use the full precision from the very beginning (to get the square root accurately). If you work it out, I think this leads to roughly an extra log(n) factor in the complexity.

#20

Here’s one more variation that’s somewhere in between Matrix exponentiation and the memoized Dijkstra recurrence.

Using the recurrence

F(n*m) = F(m+1)*F(n+1) - F(n-1)*F(m-1)
F(n*m-1) = F(n)*F(m) + F(n-1)*F(m-1)


we can implement a multiplication on Fibonacci triples:

import Base.*

struct FibTrip{T}
a::T
b::T
c::T
end

function *(x::FibTrip, y::FibTrip)
q = x.c*y.c
b = x.a*y.a - q
c = x.b*y.b + q
FibTrip(b+c, b, c)
end


Then using your pow function for power by squaring,

function pow(a,n)   # n > 0 !
if n == 1
return a
else
c = pow(a, n >> 1)
if iseven(n)
return c*c
else
return c*c * a
end
end
end


we can get any Fibonacci number by exponentiating these triples:

function fibt(n)
n < 0 && return iseven(n) ? -fibt(-n) : fibt(-n)
n == 0 && return BigInt(0)
n == 1 && return BigInt(1)
return pow(FibTrip(BigInt(1), BigInt(1), BigInt(0)), n-1).a
end

> @elapsed fibt(10^9)
13.762949728


I’m finding that this is 30-40% slower than the memoized algorithm, and a factor of 3-5 times faster than the matrix exponentiation algorithms.

One thing that surprised me a lot is that using Base.power_by_squaring in place of pow slows this algorithm down by a factor of 2. I spent a little bit of time trying to figure out why this might be and came up with nothing. Maybe the multiplications are accumulated in a slightly different order?

One thing that I think could improve this algorithm is if Julia’s BigInts had operations that could write to a pre-allocated destination instead of always allocating new memory to store the result. Similar to mul! for matrices.