Why did the doubly recursive fibonacci microbenchmark become relatively slower?

Benny · March 18, 2026, 6:43am

In the original microbenchmarks, Julia was a bit behind C/Fortran (~1.3x) on fib(n) = n < 2 ? n : fib(n - 1) + fib(n - 2) (note this is 64-bit on Julia, 32-bit on the others, but making fib type-generic didn’t make a difference for me). The most plausible explanation I could find was that non-inlined function calls have extra overhead for the GC, and it’s especially apparent in this benchmark where such calls are cheap and numerous by design. There are other microbenchmarks where Julia falls behind another language, but the function call overhead caught my eye.

The gap has widened to ~3.3x and ~4.9x, so I’m starting to doubt that explanation (why would our GC need more overhead, how is Fortran getting ahead?). I wouldn’t understand how the differences in assembly matter, so I’m asking for some insights. I don’t know where to find or how to reproduce these benchmarks at other points in time, so I don’t know when the gap started widening.

Here are the select numbers and links I got them from.

The updating Benchmarks · Julia Microbenchmarks, C/Fortran (gcc 13.3.0), Julia 1.12.5, GitHub Actions ubuntu-latest x86_64 (I don’t know the specifics):

Language	`fib(20)` time (ms)	relative to C	relative to best (Fortran)
C	0.012957	1.0	1.465
Fortran	0.008844	0.6826	1.0
Julia	0.043365	3.3468	4.903

The old static Julia Micro-Benchmarks, C/Fortran (gcc 7.3.1), Julia v1.0.0, openSUSE LEAP 15.0 Linux, Intel® Core™ i7-3960X 3.30GHz CPU, 64GB 1600MHz DDR3 RAM:

Language	`fib(20)` time (ms)	relative to C	relative to best (Fortran)
C	0.022732	1.0	1.0118
Fortran	0.022466	0.9883	1.0
Julia	0.030162	1.3269	1.3426

stevengj · March 18, 2026, 11:03am

This function is non-allocating, so I don’t think GC is relevant?

giordano · March 18, 2026, 11:22am

Do I understand correctly you’re comparing different versions of all compilers on different machines? If so, that’s not really a 1:1 comparison.

foobar_lv2 · March 18, 2026, 11:39am

I don’t think that old fib benchmark has any validity, cf godbolt

To describe what you should be seeing when clicking that link: gcc 13.3 with -O3 spits out non-recursive assembly for the fib thing.

stevengj · March 18, 2026, 12:06pm

I pulled out the fib benchmark, using both @btime from BenchmarkTools.jl and the older @timeit macro from the Julia Microbenchmark.

If I use Clang (Apple clang version 17.0.0), which like Julia uses LLVM, then Julia (version 1.12.5, libLLVM-18.1.7) is still about 30% slower than C:

Julia, Julia Cint, and C (via @btime):
  12.750 μs (0 allocations: 0 bytes)
  12.791 μs (0 allocations: 0 bytes)
  9.875 μs (0 allocations: 0 bytes)

Julia, Julia Cint, and C (via @timeit, times in ms):
julia,Julia Int,0.013416
julia,Julia Cint,0.014500
julia,C,0.010834

(this is on an Apple M4).

Code

using Test, BenchmarkTools

fib(n) = n < oftype(n, 2) ? n : fib(n - oftype(n, 1)) + fib(n - oftype(n, 2))

@test fib(20) == 6765
@test fib(Cint(20))::Cint == 6765

using BenchmarkTools

import Libdl
const Clib = "libfib"
Cfib = """
int fib(int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}
"""
open(`cc -fPIC -O3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, Cfib)
end

cfib(n) = @ccall Clib.fib(n::Cint)::Cint

@test cfib(20) == 6765

const NITER = 5
using Printf: @printf
macro timeit(ex, name)
    quote
        t = zeros(NITER)
        for i = 0:NITER
            e = 1000 * (@elapsed $(esc(ex)))
            if i > 0
                t[i] = e  # i == 0 is JIT warmup
            end
        end
        @printf "julia,%s,%f\n" $(esc(name)) minimum(t)
        GC.gc()
    end
end

println("Julia, Julia Cint, and C (via @btime):")
@btime fib(20)
@btime fib(Cint(20))
@btime cfib(20)

println("\nJulia, Julia Cint, and C (via @timeit, times in ms):")
@timeit fib(20) "Julia Int"
@timeit fib(Cint(20)) "Julia Cint"
@timeit cfib(Cint(20)) "C"

By “non-recursive” presumably it is using explicit stack management? That is “fair” — it is still using the same exponential-time algorithm, just via an optimized implementation.

On my machine it is about 1.7× faster than clang if I use gcc 15.2.0 to compile the C code:

Julia, Julia Cint, and C (via @btime):
  12.750 μs (0 allocations: 0 bytes)
  12.750 μs (0 allocations: 0 bytes)
  5.701 μs (0 allocations: 0 bytes)

Julia, Julia Cint, and C (via @timeit, times in ms):
julia,Julia Int,0.014292
julia,Julia Cint,0.015333
julia,C,0.006459

making Julia about 2.2× slower than gcc.

Can anyone replicate this using the code above?

Oscar_Smith · March 18, 2026, 12:31pm

the gc thing is that we added a gc safe point to function every which helps reduce time to safe point, but makes function calls a little slower

Benny · March 18, 2026, 12:50pm

What does that actually do at runtime and how do I spot that in reflection? Granted I never really knew what the LLVM IR and assembly meant so I would miss it.

I’m not sure if there’s something wrong with the link I can’t spot, but there’s a call fib on line 158. I don’t know how that fits 2 recursive fib calls, but it’s there. Julia 1.12.5’s LLVM spits out 4 calls, no idea what that’s about either.

I’m not running the benchmark, but that sounds right. I think the languages are just lined up by some latest release to be more relevant to contemporary users, though small non-inlined methods are not often relevant either. I’m more interested in learning what any compiler can do to what seems like a simple recursive function, as well as any Julia implementation quirks.

gbaraldi · March 18, 2026, 12:56pm

You have to do raw=true on the reflection tools (I don’t love that we lie a bit to the user)

adienes · March 18, 2026, 1:01pm

oh? I thought function calls were “free” ? in fact, often when my AI tries to eliminate function calls in the name of microoptimization I admonish it telling it that’s not necessary in Julia

Benny · March 18, 2026, 1:06pm

Tried that and now I see this extra argument and leading chunk in @code_llvm raw=true fib(Int32(20)), the type-generic version I mean. Worth noting that I don’t see this in v1.0.5, though I don’t know if that actually means there wasn’t a safepoint there.

 define swiftcc i32 @julia_fib_3229(ptr nonnull swiftself %pgcstack, i32 signext %"n::Int32") #0 !dbg !4 {
top:
  call void @llvm.dbg.value(metadata i32 %"n::Int32", metadata !13, metadata !DIExpression()), !dbg !14
  %ptls_field = getelementptr inbounds i8, ptr %pgcstack, i64 16
  %ptls_load = load ptr, ptr %ptls_field, align 8, !tbaa !15
  %0 = getelementptr inbounds i8, ptr %ptls_load, i64 16
  %safepoint = load ptr, ptr %0, align 8, !tbaa !19, !invariant.load !10
  fence syncscope("singlethread") seq_cst
  %1 = load volatile i64, ptr %safepoint, align 8, !dbg !14
  fence syncscope("singlethread") seq_cst

I also see that chunk minus the first line in something as simple as foo() = 1. That’s more work at runtime, right, not just “metadata”? And why is swift* showing up?

I’m guessing that quote was more about automatic inlining usually happening when it’d matter so the remaining manual opportunities almost always save negligible time for more code bloat, but yeah, this GC safepoint bit doesn’t look “zero”-cost.

Still not sure that explains the widening gap, though, the different compilers could be doing something else different.

Oscar_Smith · March 18, 2026, 1:22pm

Function calls are free when inlined (which Julia is pretty good at doing automatically). When not inlined, the safepoint is ~3 extra instructions (which only matters for incredibly tight recursive functions like fib)

stevengj · March 18, 2026, 1:36pm

I’m not convinced yet that there is a widening gap? I can’t reproduce it on my machine if I use similar compilers (Clang cc vs. Julia), as I noted above.

Benny · March 18, 2026, 1:38pm

Again, the numbers in the microbenchmark widened, and if it’s down to different compilers doing different optimizations, I’d like to learn about that too, and hopefully how to figure it out myself in the future.

giordano · March 18, 2026, 4:19pm

Again, if you have different version of everything, talking about “widening” is a bit moot, if everything in the initial and final point you look at is different. It’d be more interesting having fixed version of system and C/Fortran compilers, and compare different Julia versions. That’d show an actual change in the Julia performance.

I see a 2.75x difference on zenv4 (so x86-64) with Julia v1.12. It’s 2.25x when I use Julia v1.0 on the same machine, with the same version of GCC (13.3)

jakobnissen · March 18, 2026, 8:12pm

See the documentation of GC.safepoint:

  Inserts a point in the program where garbage collection may run.

  Safepoints are fast and do not themselves trigger garbage collection.
  However, if another thread has requested the GC to run, reaching a
  safepoint will cause the current thread to block and wait for the GC.

  This can be useful in rare cases in multi-threaded programs where some
  tasks are allocating memory (and hence may need to run GC) but other tasks
  are doing only simple operations (no allocation, task switches, or I/O),
  which do not yield control to Julia's runtime, and therefore blocks the GC
  from running. Calling this function periodically in the non-allocating
  tasks allows garbage collection to run.

  Note that even though safepoints are fast (typically around 2 clock
  cycles), they can still degrade performance if called in a tight loop.

IMO inserting safepoints on function calls is a good idea - or at least I see the appeal. The logic is that:

There needs to be safecalls intermittently in running Julia code; else, multithreaded programs will too often get in a situation where threads wait around for each other to hit a safepoint
It’s way, way too annoying to have to put safepoints into all functions manually, so they should be put into place automatically
Julia (via LLVM) tends to inline quite aggressively, which will delete the safepoint, so in practice, safepoints are only called in ‘big’ functions (or recursive functions, but these are essentially never both small and performance sensitive).

Therefore, putting safepoints in non-inlined functions will make it so that all ordinary Julia code will reach safepoints regularly, with only a very small overhead relative to the big functions where they are inserted.

Users should only be aware of inserting GC.safepoint when programs spend significant time in tight, inline loops - but having to think about other threads with tight, inlined loops is inevitable anyway.

I think it’s quite elegant. It’s not perfect - it would be nice if Julia’s compiler statically knew which functions could run concurrently with the GC, and it would be nice if there was some kind of tooling to statically check for tight loops without GC safepoints.

GeorgeGkountouras · March 18, 2026, 10:24pm

Wait, is this overhead still paid when disabling GC? Because that would be concerning for realtime Julia.

Furthermore, is the cost paid when excising the runtime entirely with juliac?

Why is multi-threaded assumed the default? In my case, Julia runs single-threaded. (There are other non-Julia threads, but obviously those won’t call Julia’s GC.)

gbaraldi · March 18, 2026, 10:39pm

Because Julia can run multithreaded so codegen has to assume it will happen.

Oscar_Smith · March 18, 2026, 10:51pm

Because that would be concerning for realtime Julia.

It’s hard to see this being a problem for realtime julia. It’s only an extra ~3 (very cheap) instructions per function call and embedded code generally has few function calls.

Mason · March 19, 2026, 11:39am

Just a tiny sidenote, I don’t think LLVM is very involved in our inlining. My understanding is that we do all(?) inlining on the julia side before passing the code to LLVM.

stevengj · March 19, 2026, 11:48am

As I noted in my post, gcc seems to be about 2x better than LLVM for this benchmark. So, it is good to separate two issues:

Julia has gotten slightly slower (by about 20%, it looks like), for tight recursive calls like the fib benchmark, due to the (useful!) introduction of GC safepoints in function calls. This is fine — you would never write this kind of recursive function (which does almost no work per function call) in a performance-critical scenario. Recursion is a powerful and elegant tool, but you often need to (e.g.) enlarge the base case (“coarsen” the recursion) in cases where performance matters: Recursion in Julia — bad idea? - #5 by stevengj
gcc has gotten better than LLVM for this benchmark, by around a factor of 2 — this is true even for C, as can be seen if you compile the C code with Clang. There’s probably nothing we can do about this on the Julia side. (Except, again: this is not how you would write a recursive function in a performance-critical context.)

Topic		Replies	Views
Simple recursive Fibonaci example. How to make it faster? Performance	40	7392	December 5, 2024
Benchmarks game Performance	20	4143	May 13, 2020
Shouldn't 1.8.0 be faster than Julia 1.7? Performance	30	2834	September 16, 2022
Julia Performance vs. C Performance	8	4464	June 15, 2018
Julia position in the Debian Benchmark Game can be improved, and categorization of some Julia there is unfair Performance	29	2308	December 12, 2024

Why did the doubly recursive fibonacci microbenchmark become relatively slower?

Related topics