Why did the doubly recursive fibonacci microbenchmark become relatively slower?

C++ and Rust can also run multi-threaded and don’t pay this cost. There should be a toggle, in the vein of Debug/Release builds. In dev/Debug mode, it may run with all protections, sanitizers, borrow checkers, etc. In Release mode, it should have true 0-cost abstractions.

It’s perfectly fine for juliac to be this toggle. Which is why I asked whether it works today.

First, it’s not about the specific cost, it’s having one in the first place. I’d rather my apps be doing actual number-crunching than book-keeping. Second, it’s the kind of thing that can sneak up on you and tank your performance. App developers don’t get control over inlining (other than hinting with @inline or LLVM intrinsics?), so they have to inspect the output of the compiler every time. Third, yes, I agree that ideally embedded leaf code should inline, but it doesn’t always. Embedded in this context means running in a standard desktop machine, by the way. Fourth, we’ve already seen a microbenchmark being impacted, it’s not hard to imagine a similar real workload with non-recursive function calls. Finally, I have to stress again that the runtime isn’t needed in this context, so the cost is paid “for nothing”.