In particular, this seems to be due to the fact that gcc -O3 does recursive inlining (a form of recursion coarsening, which reduces the function-call overhead) by default. If you turn this optimization off, the difference goes away:
If you pass --param max-inline-recursive-depth-auto=0 to gcc, then the 2x advantage over LLVM/Clang disappears on my machine (and, in fact, gcc becomes 8% slower than Clang).
This also means that if you coarsen your recursion manually (e.g. by enlarging the base case), which you should probably do anyway for recursive function calls that do little work, then this advantage of gcc will no longer matter.
C++ and Rust can also run multi-threaded and don’t pay this cost. There should be a toggle, in the vein of Debug/Release builds. In dev/Debug mode, it may run with all protections, sanitizers, borrow checkers, etc. In Release mode, it should have true 0-cost abstractions.
It’s perfectly fine for juliac to be this toggle. Which is why I asked whether it works today.
First, it’s not about the specific cost, it’s having one in the first place. I’d rather my apps be doing actual number-crunching than book-keeping. Second, it’s the kind of thing that can sneak up on you and tank your performance. App developers don’t get control over inlining (other than hinting with @inline or LLVM intrinsics?), so they have to inspect the output of the compiler every time. Third, yes, I agree that ideally embedded leaf code should inline, but it doesn’t always. Embedded in this context means running in a standard desktop machine, by the way. Fourth, we’ve already seen a microbenchmark being impacted, it’s not hard to imagine a similar real workload with non-recursive function calls. Finally, I have to stress again that the runtime isn’t needed in this context, so the cost is paid “for nothing”.
C++ and Rust can also run multi-threaded and don’t pay this cost.
They very much do. C and rust (and Julia) compilers all have optimizations that would be valid if you knew that code would only be run in single threaded contexts that they don’t run because the compiler doesn’t know whether there will be multiple threads at runtime. This isn’t a sanatizer or debug assertion, it’s necessary for good performance of real code.
If you have a tiny performance-critical function (e.g. it does 1 comparison and 1 addition like fib) that is not being inlined for some reason, then your performance is already hurting. The additional 3 instructions for the safepoint are the least of your problems.
If the function is not tiny, then the safepoint doesn’t matter.
It makes sense not to automatically insert safepoints if there isn’t a precise GC to begin with, but there’s no official Julia implementation that lacks a GC. None of the articles and talks about JuliaC announced compiling without the GC, and binary trimming was never described to remove instructions or arguments from compiled methods. The safepoints on their own shouldn’t be a problem for realtime programs because they don’t allocate or force a GC cycle; I posted the LLVM IR earlier, and it doesn’t look nondeterministic.
I’m not sure about this, I’m suspecting a bigger effect from the rest of the compiler. @code_llvm on 1.0.5 looks very different and more straightforward besides the lack of a visible safepoint. (GC.safepoint was introduced in 1.4, so is that also when Julia started adding safepoints to functions?) The difference is also only 3.3% (33.6μs to 34.7μs) on my system (Windows 11, x86-64).
I tried to benchmark the effect of adding a safepoint more directly, but it’s either not working somehow or looping safepoint is a time-traveling optimization:
I think the problem with your benchmark is that the optimizer sees 2 safepoints in a row and is smart enough to realize that they do the same thing and thus the second one can be deleted.
There’s a bit of redundancy elimination, but @code_llvm raw=true and @code_native raw=true still shows repeated lines in adds relative to add. No idea how 3 extra mov could save time, maybe some unintuitive pipelining?
I’m very far from an expert, but I’m pretty sure LLVM cannot delete a safepoint load. It’s implemented as a volatile load and separated from the function by a cst memory barrier both before and after, so the load cannot be deleted, or moved, or coalesced by the compiler. Intentionally so, because it would be very bad to ‘optimize away’ a ‘dead’ safepoint load.
Ok, but C doesn’t pay the “GC book-keeping for multi-threaded runtime” cost.
That’s true for 1 function, but the real workload is closer to “function calls function calls function…”.
I don’t think it’s possible to “patch” compiled code to remove safepoints, at least not easily. Instead, there should be a compilation flag gc_safepoints=false.
They’re not a problem, in the sense that “technically, it still runs”.
They’re a “problem” because instead of having 10 ms for real work in your process() function, you now have 8 ms due to book-keeping.
(I’m taking the 20% number from the microbenchmark, but obviously what matters is the real workload. It could be higher or lower. Without a way to compile real work with gc_safepoints=false, there isn’t even a way to quantify the slowdown!)
You raise a good point that this was added before juliac was a thing. The performance regression was less relevant back then.
It’s true for any number of nested functions as long as the work per function call is not tiny. For example, if it’s recursive, then the base case needs to be large enough to amortize the function-call overhead.
Otherwise you are sunk without inlining, safepoint or no safepoint.
The compiler could still emit two versions of assembly code, one for single-threaded runs and the other for multi-threaded runs, if I understand the discussion correctly. But I guess Julia chooses not to hyper-optimize corner cases like this.
One problem with the compiler toggle is we’d have to duplicate or repeat precompile workloads, 1 for the 1-threaded no-safepoint version and 1 for the n-threaded safepoint version, and there’s already a community concern about precompilation increasing.
Something we haven’t mentioned is that garbage collection is possible without safepoints. However, without certainty of GC roots (and I still don’t know how this is synchronized across threads), GCs are conservative (like the Boehm-Demers-Weiser GC for C, C++, C# on Mono, etc), delaying identification of garbage and making some approaches infeasible (copying) or limited (generational, which Julia uses). It might not be a good tradeoff to eliminate safepoints but make the GC slower or limit our options in a potential GC-swapping future. The Boehm GC has an incremental mode though, which is relevant to allocating realtime contexts like the Unity game engine.
Another interesting thing I’m starting to notice is that Rust GC crates also distinguish thread-local and thread-safe implementations, e.g. dumpster mentioning synchronization overheads. That’s also about sharing memory among threads, though, so I’m not sure if there’s any parallel to Julia (task_local_storage sounds like it’s about tasks, which can migrate threads).