Poor BigInt performance

Hi everybody,

I wanted to quick benchmark the BigInt type. I wrote a simple function (with double reccursion just because I want to consume time :wink:). I then tested the same function in Python, and found that Python 12 is 10 times faster on this example.

Here is my Julia 1.9.4 code:

nfib(n) = n > 2 ? (nfib(n-2)+nfib(n-1))*n : n

println("--- First run ---")
@time x = nfib(big(40))
println(x)

println("--- Second run ---")
@time x = nfib(big(40))
println(x)

Here is my Python 12 code:

import time

def nfib(n):
    return (nfib(n-2)+nfib(n-1))*n if n > 2 else n

print(f" --- First run ---")
t = time.perf_counter()
x = nfib(40)
print("Time=", time.perf_counter()-t, "seconds")
print(x)

print(f" --- Second run ---")
t = time.perf_counter()
x = nfib(40)
print("Time=", time.perf_counter()-t, "seconds")
print(x)

Julia 1.9.4 result:
— First run —
97.582200 seconds (1.74 G allocations: 33.548 GiB, 27.86% gc time)
12306496796223503426182275975073985352177589230680
— Second run —
97.781208 seconds (1.74 G allocations: 33.548 GiB, 28.93% gc time)
12306496796223503426182275975073985352177589230680

Python 12 result:
— First run —
Time= 8.463558899995405 seconds
12306496796223503426182275975073985352177589230680
— Second run —
Time= 8.873872700001812 seconds
12306496796223503426182275975073985352177589230680

I did not expect such a difference !
I made some progress by adding two constants in my code:

const bigone = big(1)
const bigtwo = big(2)
nfib(n) = n > bigtwo ? (nfib(n-bigtwo)+nfib(n-bigone))*n : n

With the following result:
— First run —
57.288087 seconds (921.01 M allocations: 18.299 GiB, 23.61% gc time)
12306496796223503426182275975073985352177589230680
— Second run —
57.313774 seconds (921.01 M allocations: 18.299 GiB, 24.87% gc time)
12306496796223503426182275975073985352177589230680

But, it is still 5 time slower than the same Python code.
Do you have other ideas for improvement ?

I would do two things:

  1. Since n is small, represent it using an Int, and only use BigInt for the result of the function.
  2. Switch over to Int computations for sufficiently small arguments, e.g. for n ≤ 10 it will be fine for Int even on 32-bit machines.

(Obviously, for this particular function you can do much better by using memoization or other algorithmic changes, but I’m trying to keep to basically the same algorithm.)

nfib(n) = n > 2 ? (nfib(n-2)+nfib(n-1))*n : n # uses precision of n
nfibb(n) = n > 10 ? (nfibb(n-2)+nfibb(n-1))*n : big(nfib(n)) # returns BigInt

On my machine, nfibb(40) gives the same result as nfib(big(40)) but is about 60x faster.

1 Like

The BigInt type in Julia is more or less a thin wrapper around GMP; regular operations like + are written such that the result is a new, independent object that needs to be allocated, instead of mutating one of the existing arguments. I admittedly don’t know how Python handles its default integer type internally, but I’d be willing to bet that they have shortcuts and fast paths for small numbers, which BigInt doesn’t have.

1 Like

Python here is faster because it has something that I wish Julia had: a BigInt type which is reasonably fast for small Ints because it uses under the cover a small Int for smaller bigints. A bigint is in general a pointer to some memory holding the representation. The trick is that bigints holding a smaller integer are represented by the pointer itself, where some bits are set so that one can see it is not really a pointer.

I wish someone would write a similar type for Julia. It is above my own competence.

4 Likes

Yup, they have optimizations for small numbers.

Probably more importantly, Python’s reference-counting memory management is better optimized for the case of allocating/deallocating zillions of objects, compared to Julia’s generational gc; the basic tradeoff here is that reference counting incurs a big overhead in performance-critical code that is not doing zillions of allocations, i.e. the type of code that Julia is designed to favor.

(You can improve the allocation situation by using MutableArithmetics.jl, at the price of more awkward syntax.)

6 Likes

The difficulty is all about the memory management, as I understand it. See also: BigInt and BigFloat performance is poor · Issue #4176 · JuliaLang/julia · GitHub (and links therein).

5 Likes

I have tested the hypothesis that Python better performance on this example comes from optimisation for “small” (less than 128 bits in length) integers.
Seeing the result, it rather confirm that Python actually has this optimisation, but it is not the main reason for the gap.

Here is how I modified the Julia code:

const n0 = big(2)^128
nfib(n) = n > 2 ? (nfib(n-2)+nfib(n-1))*n : n + n0

And the Python code:

n0 = 2**128
def nfib(n):
    return (nfib(n-2)+nfib(n-1))*n if n > 2 else n + n0

With the initial value n0, each reccursive step will handle numbers greater than 2^128.

And Python is still a litle less than 10 times faster than Julia on this example, showing a slightly reduced ratio (small integers optimization ?), but keeping the main part of its advantage.

Julia result:
125.004145 seconds (1.94 G allocations: 42.698 GiB, 26.73% gc time, 0.01% compilation time)
2683715248941226978387628516157198681887061918991416666520519942874482203714622794570840

Python result:
Time= 14.199815100000706 seconds
2683715248941226978387628516157198681887061918991416666520519942874482203714622794570840

2 Likes

That’s rather odd. For me nfib(40) only takes 30 seconds compared to 17 for python3. What OS, CPU, and julia version are you using?

1 Like

Windows 11 on Intel 11th generation Core i7-11850H CPU, Python 3.12, Julia 1.9.4.
32 GB ram.
64 bits OS of course.

1 Like

How did you install Julia?

1 Like

I have used the Windows installer.

1 Like

Here’s a comparison with MutableArithmetics.jl:

import MutableArithmetics; const MA = MutableArithmetics;

nfib(n) = n > 2 ? n*(nfib(n - 2) + nfib(n - 1)) : big(n)

nfib_ma(n) = if n > 2
  t = nfib_ma(n - 1)
  s = nfib_ma(n - 2)

  # Equivalent to `t = t+s`, but with less allocation.
  t = MA.operate!!(+, t, s)

  # Ideally this would be `MA.operate!!(*, t, n)`, however a relevant
  # method is missing in MA, causing extra allocation if that is used.
  t*n
else
  big(n)
end

MA also has a @rewrite macro which is supposed to be able to rewrite the usual arithmetic expressions into MA form, but in this case it doesn’t work good. There’s a lot of “low-hanging fruit” performance improvement possibilities in MA.

Timings:

julia> include("/tmp/ma_fib.jl");

julia> @time nfib_ma(40)
 26.268789 seconds (550.77 M allocations: 8.678 GiB, 29.08% gc time)
12306496796223503426182275975073985352177589230680

julia> @time nfib(40)
 36.799735 seconds (716.34 M allocations: 12.962 GiB, 31.23% gc time)
12306496796223503426182275975073985352177589230680
2 Likes

Unfortunately, the recursive nature of this algorithm still requires a huge number of allocations, even with mutable arithmetic; I can’t see a way to cut this down to a small number without fundamentally restructuring the algorithm (which, of course, you would want to do in this particular case if you actually cared about it).

1 Like

For what it’s worth, Nemo.jl is a fair bit faster for me out of the box:

julia> using Nemo

julia> nfib(n) = n > 2 ? (nfib(n-2)+nfib(n-1))*n : n
nfib (generic function with 1 method)

julia> ZZ = FlintZZ
Integer ring

julia> @time x = nfib(big(40))
 28.624818 seconds (1.13 G allocations: 18.299 GiB, 24.48% gc time)
12306496796223503426182275975073985352177589230680

julia> @time x = nfib(ZZ(40))
 21.542211 seconds (409.35 M allocations: 6.100 GiB, 31.30% gc time, 0.08% compilation time)
12306496796223503426182275975073985352177589230680
1 Like

I know how to transform the algorithm to a iterative one. By the way, I did it and nfib(40) then takes 0.2 seconds.
But, I did the double reccursion on purpose, just to have a lot of calculations.

I have another one of those gut instincts that I can’t verify right now - can Python reuse its allocations more efficiently due to its reference counting?

1 Like

Yes.

There have been attempts to do this in Julia for bignum arithmetic, e.g. julia#10084, but so far these experiments haven’t been big wins, though closer integration with the GC has been proposed (julia#11202).

3 Likes

yeah not having to use finalizers to free bigints would be very nice

It’s pretty incredible how you get an almost 2x change from 97s/1.74G allocs to 57s/921M allocs just by replacing the 1 and 2 literals with the big equivalents so they don’t have to be converted and allocated repeatedly. I don’t think this has been mentioned yet, but Python interns ints from -5 to 256, so those values never have to be allocated or deallocated, you just point to them. This is on top of all those other optimizations Python does to make int viable, though numerical computing still goes for fixed width integers. As for the naive recursive Fibonacci computation, I’m not sure how much of the allocations interning up to 256 would cut; it covers 13 Fibonacci numbers between 0 and 233, so it would depend on how often those are repeatedly allocated.

9 Likes

I agree, that is the elephant in the room !