Why did the doubly recursive fibonacci microbenchmark become relatively slower?

In the original microbenchmarks, Julia was a bit behind C/Fortran (~1.3x) on fib(n) = n < 2 ? n : fib(n - 1) + fib(n - 2) (note this is 64-bit on Julia, 32-bit on the others, but making fib type-generic didn’t make a difference for me). The most plausible explanation I could find was that non-inlined function calls have extra overhead for the GC, and it’s especially apparent in this benchmark where such calls are cheap and numerous by design. There are other microbenchmarks where Julia falls behind another language, but the function call overhead caught my eye.

The gap has widened to ~3.3x and ~4.9x, so I’m starting to doubt that explanation (why would our GC need more overhead, how is Fortran getting ahead?). I wouldn’t understand how the differences in assembly matter, so I’m asking for some insights. I don’t know where to find or how to reproduce these benchmarks at other points in time, so I don’t know when the gap started widening.

Here are the select numbers and links I got them from.

The updating Benchmarks · Julia Microbenchmarks, C/Fortran (gcc 13.3.0), Julia 1.12.5, GitHub Actions ubuntu-latest x86_64 (I don’t know the specifics):

Language fib(20) time (ms) relative to C relative to best (Fortran)
C 0.012957 1.0 1.465
Fortran 0.008844 0.6826 1.0
Julia 0.043365 3.3468 4.903

The old static Julia Micro-Benchmarks, C/Fortran (gcc 7.3.1), Julia v1.0.0, openSUSE LEAP 15.0 Linux, Intel® Core™ i7-3960X 3.30GHz CPU, 64GB 1600MHz DDR3 RAM:

Language fib(20) time (ms) relative to C relative to best (Fortran)
C 0.022732 1.0 1.0118
Fortran 0.022466 0.9883 1.0
Julia 0.030162 1.3269 1.3426
2 Likes

This function is non-allocating, so I don’t think GC is relevant?

Do I understand correctly you’re comparing different versions of all compilers on different machines? If so, that’s not really a 1:1 comparison.

I don’t think that old fib benchmark has any validity, cf godbolt

To describe what you should be seeing when clicking that link: gcc 13.3 with -O3 spits out non-recursive assembly for the fib thing.

1 Like

I pulled out the fib benchmark, using both @btime from BenchmarkTools.jl and the older @timeit macro from the Julia Microbenchmark.

If I use Clang (Apple clang version 17.0.0), which like Julia uses LLVM, then Julia (version 1.12.5, libLLVM-18.1.7) is still about 30% slower than C:

Julia, Julia Cint, and C (via @btime):
  12.750 μs (0 allocations: 0 bytes)
  12.791 μs (0 allocations: 0 bytes)
  9.875 μs (0 allocations: 0 bytes)

Julia, Julia Cint, and C (via @timeit, times in ms):
julia,Julia Int,0.013416
julia,Julia Cint,0.014500
julia,C,0.010834

(this is on an Apple M4).

Code
using Test, BenchmarkTools

fib(n) = n < oftype(n, 2) ? n : fib(n - oftype(n, 1)) + fib(n - oftype(n, 2))

@test fib(20) == 6765
@test fib(Cint(20))::Cint == 6765

using BenchmarkTools

import Libdl
const Clib = "libfib"
Cfib = """
int fib(int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}
"""
open(`cc -fPIC -O3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, Cfib)
end

cfib(n) = @ccall Clib.fib(n::Cint)::Cint

@test cfib(20) == 6765

const NITER = 5
using Printf: @printf
macro timeit(ex, name)
    quote
        t = zeros(NITER)
        for i = 0:NITER
            e = 1000 * (@elapsed $(esc(ex)))
            if i > 0
                t[i] = e  # i == 0 is JIT warmup
            end
        end
        @printf "julia,%s,%f\n" $(esc(name)) minimum(t)
        GC.gc()
    end
end

println("Julia, Julia Cint, and C (via @btime):")
@btime fib(20)
@btime fib(Cint(20))
@btime cfib(20)

println("\nJulia, Julia Cint, and C (via @timeit, times in ms):")
@timeit fib(20) "Julia Int"
@timeit fib(Cint(20)) "Julia Cint"
@timeit cfib(Cint(20)) "C"

By “non-recursive” presumably it is using explicit stack management? That is “fair” — it is still using the same exponential-time algorithm, just via an optimized implementation.

On my machine it is about 1.7× faster than clang if I use gcc 15.2.0 to compile the C code:

Julia, Julia Cint, and C (via @btime):
  12.750 μs (0 allocations: 0 bytes)
  12.750 μs (0 allocations: 0 bytes)
  5.701 μs (0 allocations: 0 bytes)

Julia, Julia Cint, and C (via @timeit, times in ms):
julia,Julia Int,0.014292
julia,Julia Cint,0.015333
julia,C,0.006459

making Julia about 2.2× slower than gcc.

Can anyone replicate this using the code above?

2 Likes

the gc thing is that we added a gc safe point to function every which helps reduce time to safe point, but makes function calls a little slower

4 Likes

What does that actually do at runtime and how do I spot that in reflection? Granted I never really knew what the LLVM IR and assembly meant so I would miss it.

I’m not sure if there’s something wrong with the link I can’t spot, but there’s a call fib on line 158. I don’t know how that fits 2 recursive fib calls, but it’s there. Julia 1.12.5’s LLVM spits out 4 calls, no idea what that’s about either.

I’m not running the benchmark, but that sounds right. I think the languages are just lined up by some latest release to be more relevant to contemporary users, though small non-inlined methods are not often relevant either. I’m more interested in learning what any compiler can do to what seems like a simple recursive function, as well as any Julia implementation quirks.

You have to do raw=true on the reflection tools (I don’t love that we lie a bit to the user)

2 Likes

oh? I thought function calls were “free” ? in fact, often when my AI tries to eliminate function calls in the name of microoptimization I admonish it telling it that’s not necessary in Julia :cry:

Tried that and now I see this extra argument and leading chunk in @code_llvm raw=true fib(Int32(20)), the type-generic version I mean. Worth noting that I don’t see this in v1.0.5, though I don’t know if that actually means there wasn’t a safepoint there.

 define swiftcc i32 @julia_fib_3229(ptr nonnull swiftself %pgcstack, i32 signext %"n::Int32") #0 !dbg !4 {
top:
  call void @llvm.dbg.value(metadata i32 %"n::Int32", metadata !13, metadata !DIExpression()), !dbg !14
  %ptls_field = getelementptr inbounds i8, ptr %pgcstack, i64 16
  %ptls_load = load ptr, ptr %ptls_field, align 8, !tbaa !15
  %0 = getelementptr inbounds i8, ptr %ptls_load, i64 16
  %safepoint = load ptr, ptr %0, align 8, !tbaa !19, !invariant.load !10
  fence syncscope("singlethread") seq_cst
  %1 = load volatile i64, ptr %safepoint, align 8, !dbg !14
  fence syncscope("singlethread") seq_cst

I also see that chunk minus the first line in something as simple as foo() = 1. That’s more work at runtime, right, not just “metadata”? And why is swift* showing up?

I’m guessing that quote was more about automatic inlining usually happening when it’d matter so the remaining manual opportunities almost always save negligible time for more code bloat, but yeah, this GC safepoint bit doesn’t look “zero”-cost.

Still not sure that explains the widening gap, though, the different compilers could be doing something else different.

Function calls are free when inlined (which Julia is pretty good at doing automatically). When not inlined, the safepoint is ~3 extra instructions (which only matters for incredibly tight recursive functions like fib)

4 Likes

I’m not convinced yet that there is a widening gap? I can’t reproduce it on my machine if I use similar compilers (Clang cc vs. Julia), as I noted above.

Again, the numbers in the microbenchmark widened, and if it’s down to different compilers doing different optimizations, I’d like to learn about that too, and hopefully how to figure it out myself in the future.

Again, if you have different version of everything, talking about “widening” is a bit moot, if everything in the initial and final point you look at is different. It’d be more interesting having fixed version of system and C/Fortran compilers, and compare different Julia versions. That’d show an actual change in the Julia performance.

I see a 2.75x difference on zenv4 (so x86-64) with Julia v1.12. It’s 2.25x when I use Julia v1.0 on the same machine, with the same version of GCC (13.3)

See the documentation of GC.safepoint:

  Inserts a point in the program where garbage collection may run.

  Safepoints are fast and do not themselves trigger garbage collection.
  However, if another thread has requested the GC to run, reaching a
  safepoint will cause the current thread to block and wait for the GC.

  This can be useful in rare cases in multi-threaded programs where some
  tasks are allocating memory (and hence may need to run GC) but other tasks
  are doing only simple operations (no allocation, task switches, or I/O),
  which do not yield control to Julia's runtime, and therefore blocks the GC
  from running. Calling this function periodically in the non-allocating
  tasks allows garbage collection to run.

  Note that even though safepoints are fast (typically around 2 clock
  cycles), they can still degrade performance if called in a tight loop.

IMO inserting safepoints on function calls is a good idea - or at least I see the appeal. The logic is that:

  1. There needs to be safecalls intermittently in running Julia code; else, multithreaded programs will too often get in a situation where threads wait around for each other to hit a safepoint
  2. It’s way, way too annoying to have to put safepoints into all functions manually, so they should be put into place automatically
  3. Julia (via LLVM) tends to inline quite aggressively, which will delete the safepoint, so in practice, safepoints are only called in ‘big’ functions (or recursive functions, but these are essentially never both small and performance sensitive).

Therefore, putting safepoints in non-inlined functions will make it so that all ordinary Julia code will reach safepoints regularly, with only a very small overhead relative to the big functions where they are inserted.

Users should only be aware of inserting GC.safepoint when programs spend significant time in tight, inline loops - but having to think about other threads with tight, inlined loops is inevitable anyway.

I think it’s quite elegant. It’s not perfect - it would be nice if Julia’s compiler statically knew which functions could run concurrently with the GC, and it would be nice if there was some kind of tooling to statically check for tight loops without GC safepoints.

2 Likes

Wait, is this overhead still paid when disabling GC? Because that would be concerning for realtime Julia.

Furthermore, is the cost paid when excising the runtime entirely with juliac?

Why is multi-threaded assumed the default? In my case, Julia runs single-threaded. (There are other non-Julia threads, but obviously those won’t call Julia’s GC.)

Because Julia can run multithreaded so codegen has to assume it will happen.

Because that would be concerning for realtime Julia.

It’s hard to see this being a problem for realtime julia. It’s only an extra ~3 (very cheap) instructions per function call and embedded code generally has few function calls.

1 Like