Why did the doubly recursive fibonacci microbenchmark become relatively slower?

In particular, this seems to be due to the fact that gcc -O3 does recursive inlining (a form of recursion coarsening, which reduces the function-call overhead) by default. If you turn this optimization off, the difference goes away:

  • If you pass --param max-inline-recursive-depth-auto=0 to gcc, then the 2x advantage over LLVM/Clang disappears on my machine (and, in fact, gcc becomes 8% slower than Clang).
  • There was work to implement a similar optimization for LLVM, but apparently it was abandoned?

This also means that if you coarsen your recursion manually (e.g. by enlarging the base case), which you should probably do anyway for recursive function calls that do little work, then this advantage of gcc will no longer matter.

C++ and Rust can also run multi-threaded and don’t pay this cost. There should be a toggle, in the vein of Debug/Release builds. In dev/Debug mode, it may run with all protections, sanitizers, borrow checkers, etc. In Release mode, it should have true 0-cost abstractions.

It’s perfectly fine for juliac to be this toggle. Which is why I asked whether it works today.

First, it’s not about the specific cost, it’s having one in the first place. I’d rather my apps be doing actual number-crunching than book-keeping. Second, it’s the kind of thing that can sneak up on you and tank your performance. App developers don’t get control over inlining (other than hinting with @inline or LLVM intrinsics?), so they have to inspect the output of the compiler every time. Third, yes, I agree that ideally embedded leaf code should inline, but it doesn’t always. Embedded in this context means running in a standard desktop machine, by the way. Fourth, we’ve already seen a microbenchmark being impacted, it’s not hard to imagine a similar real workload with non-recursive function calls. Finally, I have to stress again that the runtime isn’t needed in this context, so the cost is paid “for nothing”.

C++ and Rust can also run multi-threaded and don’t pay this cost.

They very much do. C and rust (and Julia) compilers all have optimizations that would be valid if you knew that code would only be run in single threaded contexts that they don’t run because the compiler doesn’t know whether there will be multiple threads at runtime. This isn’t a sanatizer or debug assertion, it’s necessary for good performance of real code.

If you have a tiny performance-critical function (e.g. it does 1 comparison and 1 addition like fib) that is not being inlined for some reason, then your performance is already hurting. The additional 3 instructions for the safepoint are the least of your problems.

If the function is not tiny, then the safepoint doesn’t matter.

It makes sense not to automatically insert safepoints if there isn’t a precise GC to begin with, but there’s no official Julia implementation that lacks a GC. None of the articles and talks about JuliaC announced compiling without the GC, and binary trimming was never described to remove instructions or arguments from compiled methods. The safepoints on their own shouldn’t be a problem for realtime programs because they don’t allocate or force a GC cycle; I posted the LLVM IR earlier, and it doesn’t look nondeterministic.

I’m not sure about this, I’m suspecting a bigger effect from the rest of the compiler. @code_llvm on 1.0.5 looks very different and more straightforward besides the lack of a visible safepoint. (GC.safepoint was introduced in 1.4, so is that also when Julia started adding safepoints to functions?) The difference is also only 3.3% (33.6μs to 34.7μs) on my system (Windows 11, x86-64).

I tried to benchmark the effect of adding a safepoint more directly, but it’s either not working somehow or looping safepoint is a time-traveling optimization:

julia> @noinline add(x,y) = x+y
add (generic function with 1 method)

julia> @noinline adds(x,y) = (GC.safepoint(); x+y)
adds (generic function with 1 method)

julia> @btime @noinline(add($1, $2)) evals=10000
  3.080 ns (0 allocations: 0 bytes)
3

julia> @btime @noinline(adds($1, $2)) evals=10000
  2.570 ns (0 allocations: 0 bytes)
3

I think the problem with your benchmark is that the optimizer sees 2 safepoints in a row and is smart enough to realize that they do the same thing and thus the second one can be deleted.

There’s a bit of redundancy elimination, but @code_llvm raw=true and @code_native raw=true still shows repeated lines in adds relative to add. No idea how 3 extra mov could save time, maybe some unintuitive pipelining?

I’m very far from an expert, but I’m pretty sure LLVM cannot delete a safepoint load. It’s implemented as a volatile load and separated from the function by a cst memory barrier both before and after, so the load cannot be deleted, or moved, or coalesced by the compiler. Intentionally so, because it would be very bad to ‘optimize away’ a ‘dead’ safepoint load.

Ok, but C doesn’t pay the “GC book-keeping for multi-threaded runtime” cost.

That’s true for 1 function, but the real workload is closer to “function calls function calls function…”.

I don’t think it’s possible to “patch” compiled code to remove safepoints, at least not easily. Instead, there should be a compilation flag gc_safepoints=false.

They’re not a problem, in the sense that “technically, it still runs”.

They’re a “problem” because instead of having 10 ms for real work in your process() function, you now have 8 ms due to book-keeping.

(I’m taking the 20% number from the microbenchmark, but obviously what matters is the real workload. It could be higher or lower. Without a way to compile real work with gc_safepoints=false, there isn’t even a way to quantify the slowdown!)

You raise a good point that this was added before juliac was a thing. The performance regression was less relevant back then.

It’s true for any number of nested functions as long as the work per function call is not tiny. For example, if it’s recursive, then the base case needs to be large enough to amortize the function-call overhead.

Otherwise you are sunk without inlining, safepoint or no safepoint.

The compiler could still emit two versions of assembly code, one for single-threaded runs and the other for multi-threaded runs, if I understand the discussion correctly. But I guess Julia chooses not to hyper-optimize corner cases like this.

One problem with the compiler toggle is we’d have to duplicate or repeat precompile workloads, 1 for the 1-threaded no-safepoint version and 1 for the n-threaded safepoint version, and there’s already a community concern about precompilation increasing.

Something we haven’t mentioned is that garbage collection is possible without safepoints. However, without certainty of GC roots (and I still don’t know how this is synchronized across threads), GCs are conservative (like the Boehm-Demers-Weiser GC for C, C++, C# on Mono, etc), delaying identification of garbage and making some approaches infeasible (copying) or limited (generational, which Julia uses). It might not be a good tradeoff to eliminate safepoints but make the GC slower or limit our options in a potential GC-swapping future. The Boehm GC has an incremental mode though, which is relevant to allocating realtime contexts like the Unity game engine.

Another interesting thing I’m starting to notice is that Rust GC crates also distinguish thread-local and thread-safe implementations, e.g. dumpster mentioning synchronization overheads. That’s also about sharing memory among threads, though, so I’m not sure if there’s any parallel to Julia (task_local_storage sounds like it’s about tasks, which can migrate threads).

The concurrent-first language Go has a hybrid system. Besides the familiar inserted synchronous safepoints and precise GC, an asynchronous safepoint can be triggered by the runtime to preempt tight loops after a 10ms timer runs out. Apparently it doesn’t insert the instruction overhead into the loop, but it requires conservative scanning of the frame and is complicated enough to have remaining caveats (Issue #36365 · golang/go).

Case in point, I found GC frames in sincos, atan, sin, cos, exp and expm1. Removing them gave a small but important win.

Are you referring to safepoints here or something else?

How?

Yes.

ccall to compile-time const Base.Math.libm to prevent dynamic lookup.

Can you give a working benchmark, ideally of all of the mentioned functions, because that doesn’t sound like it’d remove the automatic safepoints in method calls. I’d sooner suspect the performance differences stem from openlibm versus Julia ports, but it’s hard to tell without knowing what you’re doing exactly. Just to be clear, I’m not capable of reproducing the effect from your descriptions:

julia> x = 1.5; @btime sin($x) evals=100000
  6.167 ns (0 allocations: 0 bytes)
0.9974949866040544

julia> x = 1.5; @btime @noinline(sin($x)) evals=100000 # I think noinline failed
  6.167 ns (0 allocations: 0 bytes)
0.9974949866040544

julia> @noinline mysin(x) = sin(x) # try to stop inlining, would add a safepoint
mysin (generic function with 1 method)

julia> x = 1.5; @btime mysin($x) evals=100000
  7.253 ns (0 allocations: 0 bytes)
0.9974949866040544

julia> Base.Math.libm
"libopenlibm"

julia> Base.libm_name
"libopenlibm"

julia> x = 1.5; @btime @ccall(Base.libm_name.sin($x::Cdouble)::Cdouble) evals=100000
  7.362 ns (0 allocations: 0 bytes)
0.9974949866040544

julia> @noinline mylibmsin(x::Cdouble) = @ccall Base.libm_name.sin(x::Cdouble)::Cdouble
libmsin (generic function with 1 method)

julia> x = 1.5; @btime mylibmsin($x) evals=100000
  8.388 ns (0 allocations: 0 bytes)
0.9974949866040544

You can get a lot more than a small win if you implement your own versions of these functions. I had an embedded C project where the sin function call overhead was too slow so we used a 256 entry lookup table for audio synthesis of a sine wave.

These are the timings I’m getting with your example:

julia> x = 1.5; @btime @ccall(Base.libm_name.sin($x::Cdouble)::Cdouble) evals=100000
  8.377 ns (0 allocations: 0 bytes)
0.9974949866040544

julia> const MY_LIBM = Base.Math.libm
"libopenlibm"

julia> x = 1.5; @btime @ccall(MY_LIBM.sin($x::Cdouble)::Cdouble) evals=100000
  7.841 ns (0 allocations: 0 bytes)
0.9974949866040544

Yeah, I’m doing a bit of that too for other functions. I didn’t tabulate sin.

256 points wasn’t enough for my precision needs.

:sob:

A custom polynomial approximation could be superior to a table lookup.

Oh wow, that does make a difference (@code_llvm is much longer for mylibmsin than mylibmsin2, though mylibsim2 and mylibsim3 look like the same call and return), though still not enough to beat the native sin port on my system. That threw me off because the compiler typically inlines a constant variable’s value no matter how many modules we access (compare @code_llvm for libm1() = MY_LIBM and libm2() = Base.Math.libm). The @ccall macro reasonably doesn’t eval it and a ccall alone doesn’t compile, and somehow compiling a surrounding function doesn’t help inline the value into a ccall.

julia> Base.libm_name === Base.Math.libm === MY_LIBM === "libopenlibm"
true

julia> isconst.([Base Base.Math Main], [:libm_name :libm :MY_LIBM])
1×3 BitMatrix:
 1  1  1

julia> @noinline mylibmsin2(x::Cdouble) = @ccall MY_LIBM.sin(x::Cdouble)::Cdouble
mylibmsin2 (generic function with 1 method)

julia> @noinline mylibmsin3(x::Cdouble) = @ccall "libopenlibm".sin(x::Cdouble)::Cdouble
mylibmsin3 (generic function with 1 method)
julia> begin
       x = 1.5
       @btime sin($x) evals=100000
       @btime mysin($x) evals=100000
       @btime @ccall(Base.libm_name.sin($x::Cdouble)::Cdouble) evals=100000
       @btime mylibmsin($x) evals=100000
       @btime @ccall(MY_LIBM.sin($x::Cdouble)::Cdouble) evals=100000
       @btime mylibmsin2($x) evals=100000
       @btime @ccall "libopenlibm".sin($x::Cdouble)::Cdouble
       @btime mylibmsin3($x) evals=100000
       end;
  6.167 ns (0 allocations: 0 bytes)
  7.252 ns (0 allocations: 0 bytes)
  7.277 ns (0 allocations: 0 bytes)
  8.602 ns (0 allocations: 0 bytes)
  6.740 ns (0 allocations: 0 bytes)
  7.857 ns (0 allocations: 0 bytes)
  6.700 ns (0 allocations: 0 bytes)
  8.280 ns (0 allocations: 0 bytes)

Right now, I do think calling C code itself doesn’t contribute an automatic safepoint like non-inlined Julia method calls, but the overall ccall does benefit from inlining and eliding safepoints because of ccall’s automatic cconvert and unsafe_convert calls. If you’re seeing longer runtimes with the plain sin(x) benchmark that is also inlining, then I’d chalk that difference up to system dependence of the Julia vs C compilers, which is looking like the main reason for the original question as well.