Performance comparison, Julia is a slower than Mojo, am I doing something wrong?

As expected for tail-recursive calls, FibonacciTailRecursive adds adjacent numbers as the stack frames pile up, and no algorithmic calculations happen when the stack unwinds. It’s the loop while !(n == 1) n, previous, current = n-1, current, previous+current end but with unnecessary temporary storage of previous iterations.
fib on the other hand forms a binary tree of recursive calls deferring additions to whenever the stack unwinds, and most of those calls are redundant e.g. fib(10) calls fib(9) and fib(8), and fib(9) calls fib(8) and fib(7), so fib(8) is observed to be repeated. Memoization can help by caching results instead of recalculating redundant calls.

Mojo does support TCO.
Even the naive call can be optimized, reducing the number of calls from 2 to 1, e.g. look at this C++ assembly:

Mojo does support TCO. I recall reading a comment saying that their RAII implementation actually had TCO in mind, getting it so deallocations can happen before the tail call. This is in contrast to C++, where RAII-inserted destructors will be inserted at the end/in the tail positions, preventing the optimization.

Therefore, I’d assume the Mojo code was optimized like the C++ above.

FWIW, I get

julia> @time fibcpp(50)
 47.515274 seconds
12586269025

julia> @time fib(50)
 57.792934 seconds
12586269025
4 Likes

How much is this overhead? Is it larger than the overhead of a virtual function call in C++?

P.S. Is StaticCompiler.jl free from this overhead?

No, it’s smaller than a virtual call, the overhead is just an atomic load. And yes StaticCompiler does not have this overhead.

2 Likes

Dumb question: is Fibonacci sequence used for anything?

2 Likes

Of course, it’s primary application is as an ambiguous micro-benchmark for comparing programming language execution! :joy:

25 Likes

It pops up in some places: wiki

yes, but any fibonnacci like sequence is really dumb to compute using recursion.

FWIW, this is my preferred Julia implementation

julia> fib(::Val{N}) where {N} = N <= 1 ? N : fib(Val(N-2)) + fib(Val(N-1))
fib (generic function with 1 method)

julia> fib(N::Int) = fib(Val(N))
fib (generic function with 2 methods)

julia> @time fib(50)
  0.000000 seconds
12586269025
23 Likes

very nice. but I had in mind something more like

issquare(n) = let s = isqrt(n); isequal(s^2, n); end
isfib(n) = issquare(5n^2 + 4) || issquare(5n^2 - 4)

function fib(n)
    v = 0
    i = 0
    while i < n
        i += isfib(v)
        v += 1
    end
    return v - 1
end

I remember swveral almost identical, but reversed, discussions about this, comparing Python/numpy and Julia.

The Python side of the discussion argued that it was dumb to use recursion instead of vectorization to compute Fibonacci. The reasoning in favor of this benchmark was (and imo, remains) that it’s not a Fibonacci benchmark, but a recursion benchmark.

3 Likes

If you look bad in a benchmark, attack the benchmark, if you look good, attack the other participants is always the pattern here :joy: Well attack is a strong word but it is regularly the case that one side feels attacked by the use of a “bad” benchmark from the other side to underpin their superiority. Real-world code usually has so many different aspects that the differences get much muddier. In this case you switch to: If you look bad in one aspect, focus on another one :upside_down_face:

4 Likes

For example, the Fibonacci benchmarks all use the same (inefficient) doubly-recursive algorithm, and the pi summation benchmarks use the same for-loop.

4 Likes

if you look at the post @Oscar_Smith was responding to, I think it’s clear that they’re not discussing the benchmark. One person asked if the Fibonacci sequence is used for anything useful, then another person said that it pops up in some places, and then Oscar agreed that it does pop up, but it would be really stupid to calculate it using this recursive algorithm (which is true).

Oscar didn’t say anything about it being a bad benchmark, just that it’s a bad algorithm.

4 Likes

This code doesn’t give the right answer - https://discourse.julialang.org/t/simple-recursive-fibonaci-example-how-to-make-it-faster/32369/15

fibval(50) gives 20365011074
fib(50) gives 12586269025
Fibonacci[50] gives 12586269025 # Mathematica code.

What gives?

gani –

It returns 1 instead of N when N <= 1.

1 Like

That’s pretty cool! In the performance guidelines, if I recall correctly, they say that over-templating on values can lead to detrimental performance, because you need to compile every time a new N is used as an input argument. How exactly does this code avoid this issue? Is there a lot of stuff happening behind the scenes on those first two lines? Does Julia compile this for a few candidate values before you call fib(N)?

This reminds me a lot of compile-time constexpr computation in C++ using value-type template parameters, which I use (and abuse) a lot in some projects.

I haven’t looked closely, but since it’s recursive, it’s gonna compile fib(Val{1})...fib(Val{N}), so anything less than N doesn’t need to be recompiled and a call greater than N will benefit from the already compiled call.

2 Likes

Yes. Note that it is also vastly faster to compile than it is to actually call a recursive version without Vals, because the compilation memoizes.

That is, despite looking recursive, compilation time is O(N).
Each of those N methods getting compiled are also extremely simple, so compilation feels instant.

Much better than O(2^N) runtime of a regular recursive implementation.

Anyway, in this case it works well.
But both Julia and C++ can suffer from extreme compile times and code bloat from excessive monomorphization.
Basically no code bloat in this case, because it compiles away.

So I agree with the performance tips to avoid Val(N) in general. But there are cases where like this where it is very powerful!

4 Likes

EDIT: See my next post on improvement, regarding timing, already achieved, and some speculation.

Yes, and no (basically no; for speed), you CAN have TCO, it’s just not the default (a good choice, it was discussed and decided on julialang github, that it shouldn’t be the default, same in e.g. Scala; bad for debugging, and not really faster in practice, for most real-world algorithms). So if Mojo actually defaults to TCO (I think it does), it may be a mistake.

julia> import Pkg; Pkg.add(url="https://github.com/TakekazuKATO/TailRec.jl")

The (see below) tail-recursive fib(40) (without actually using that package) is 323546/8.404 = 38499 times faster, and increasing to 915720 times faster with fib(50) (for @time; with @btime seemed only 39893/10.925 = 3651 faster, I guess I miscalculated). That’s despite allocations, and increasing for higher n, and none for the original non-recursive. So don’t knock allocations in all cases…

fib(50) is of course slower than fib(40), but it’s a bit strange how the max is reliably an order of magnitude higher, likely because of “3 evaluations” vs 1, so should be disregarded:

julia> @benchmark fib(40)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.457 μs … 545.433 μs  ┊ GC (min … max): 0.00% … 96.32%
 Time  (median):     9.478 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.790 μs ±   5.476 μs  ┊ GC (mean ± σ):  0.54% ±  0.96%

  ▂ ▁▅▁▆▅▃█▇▅▅▆▅▃▁▁▂▁                                         ▂
  █▅█████████████████▇▅▇▇▆▆▇█▇▅▆▆▆▄▄▄▆▄▆▂▂▅▄▄▄▅▄▅▆▆▆▆▇▆▇██▆▆▆ █
  8.46 μs      Histogram: log(frequency) by time      15.2 μs <

 Memory estimate: 1.42 KiB, allocs estimate: 91.


julia> @benchmark fib(50)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.970 μs … 42.054 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.961 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.722 μs ±  2.212 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █      ▂                                                    
  ▅██▇▃▃▃▃█▇▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂ ▃
  11 μs           Histogram: frequency by time          26 μs <

 Memory estimate: 1.89 KiB, allocs estimate: 121.

I used ChatGPT, in effect, well actually Poe.com, in case I would need to try some non-default (I didn’t ) AI LLM.

General-purpose assistant bot with strengths in programming-related tasks and non-English languages. Powered by gpt-3.5-turbo. Formerly known as Sage.

Can you change this to a TCO version: function fib(n::Int64)::Int64
n <= 2 && return 1
n == 3 && return 2
fib(n - 1) + fib(n - 2)
end

Assistant

Poe
Certainly! Here’s the tail-recursive version of the Fibonacci function:

function fib(n::Int64)::Int64
    function fib_helper(n::Int64, a::Int64, b::Int64)::Int64
        n <= 2 && return b
        fib_helper(n - 1, b, a + b)
    end
    
    n <= 2 && return 1
    n == 3 && return 2
    fib_helper(n - 1, 1, 2)
end

In this version, the fib_helper function takes two additional arguments, a and b, which represent the Fibonacci numbers fib(n-2) and fib(n-1) respectively. The fib_helper function is tail-recursive because the recursive call is the last operation performed, and it accumulates the intermediate results by updating a and b in each recursive call.

This tail-recursive implementation allows for better performance and avoids stack overflow errors for large values of n.

The only thing wrong about that answer is that TCO isn’t applied by Julia, by default, and that doesn’t really matter much.

With:

julia> @time using TailRec
  0.000150 seconds (140 allocations: 13.141 KiB)

and sticking @tailrec in front of the function definition didn’t really change much timing wise. Stack-frames aren’t that slow it seems. Just having a tail-rec version of the algorithm matters, the Optimization, O in TCO (with that macro), doesn’t matter so much, at least in this case.

What happens in CPUs is that you write and read from the stack, which is just memory, and writing isn’t slow, since you do not have to wait for it committing to memory. Reading from memory can be really slow, but again not here, since it’s from the cache. And you’re reading and (over)writing from that relatively small part of memory that only the last writes are actually committed to RAM. You might think a loop would actually help, but branch prediction seems to be the bottleneck.

Right, and that version is slower. Two checks and branches, vs one. It claims “More than 35000x faster.” which was though actually a small miscalculation, I calculated a little lower from their data… Not a huge issue since, it’s not a constant factor.

It’s actually 203163 faster for fib(40) growing to 13431598 faster for fib(50). I didn’t want to go higher, wait for fib(92) which is highest before overflow… Then fib(big"93").

I wanted to see if the AI would find that neat version you pointed to. Before benchmarking it seemed what the AI came up with is a better algorithm. But I benchmarked and that version is 9-20 faster than the AI version. Most likely because it has no allocations. I’ll stare at the AI version ans see where they come from. Possibly can be eliminated, and then faster yet… It can be, see next post.

From a compiler point of view we don’t have to create new stack frames at all. It could do that but it doesn’t as the compiler knows it’s safe to stay in the same stack frame.

That’s the TCO idea, and it’s implied that C# supports it, i.e. the (C#) compiler. It’s however claimed that C# does NOT support it, but in practice does (this language is strange C# vs CLR VM is strange):

Support

  • We’ve been using C in this post since GCC and Clang both support tail call optimization (TCO).
    For C++, the case holds with Microsoft’s Visual C++ also offering support.
  • Java and Python do not support TCO with the intention of preserving the stack trace for debugging.
  • Some internal Java classes also rely on the number of stack frames. Python’s BDFL, Guido van Rossum, has explicitly stated that no Python implementations should support TCO.
  • Kotlin even comes with a dedicated tailrec keyword which converts recursive functions to iterative ones, since the JVM (Kotlin compiles to JVM bytecode) doesn’t support TCO.
    TCO is part of ECMAScript 6 i.e. the JavaScript specification, however, only Safari’s JavaScriptCore engine supports TCO. **Chrome’s V8 retracted support for TCO.
  • C# does not support TCO, however, the VM it runs within, Common Language Runtime (CLR) supports TCO.
  • Functional languages such as Haskell, F#, Scala and Elixir support TCO.

It appears that support for TCO is more of an ideological choice for language implementers, rather than a technical one. It does manipulate the stack in ways the programmer would not expect and hence makes debugging harder. Refer the documentation of the specific implementation of your favorite language to check if it supports tail call optimization.

Conclusion
I hope you understood the idea and techniques behind TCO. I guess the takeaway here is to prefer iterative solutions over recursive ones (that is almost always a good idea, performance-wise).

4 Likes