Elimination of (unnecessary) runtime dispatch allocations

Avoiding runtime dispatch is perhaps the #1 optimization advice in Julia forums.
As I understand it, the reason is that when the Julia compiler does a runtime dispatch it does not at compile-time neccesarily know the return type of the function to be called. And so, it needs to allocate space for the return value on the heap (“box’ed”).

Hence, common advice is to avoid runtime dispatch. However, this is not always possible - probably the most frequent example is when processing a vector of heterogenous elements. E.g. a GUI library drawing a list of screen components, a ray tracer processing a list of light sources, a financial library pricing a list of financial instruments, etc…

The minimal viable example is this:

using BenchmarkTools
using Test
using JET

abstract type AbstractFoo end
struct Foo1 <: AbstractFoo end
struct Foo2 <: AbstractFoo end
struct Foo3 <: AbstractFoo end
struct Foo4 <: AbstractFoo end
struct Foo5 <: AbstractFoo end

f(::Foo1) = rand()
f(::Foo2) = rand()
f(::Foo3) = rand()
f(::Foo4) = rand()
f(::Foo5) = rand()

g(arr) = f(first(arr))
foos = AbstractFoo[Foo1()]

@btime g($foos) # 25.978 ns (1 allocation: 16 bytes)
@inferred g(foos) # return type Float64 does not match inferred return type Any
@report_opt g(foos) # runtime dispatch detected: f(%1::AbstractFoo)::Any

In the above example f() always returns a Float64 and there is a valid f for each possible instance of AbstractFoo; yet, the compiler prepares for any return type leading to an (unneccesary) allocation.

I’m not here to ask for specific compiler optimizations - but if I were - this would be at the top of my list. IMHO Julia can do (almost?) anything C++/Rust can do - but easier and just as fast - except for this! And this is a very common (and very valid) programming paradigm.

This topic has been raised before (see related), yet I hope more attention may be beneficial or provide further information. My current capacity only allows for shedding light on the issue - not solving it…

And please note, I’m not looking for workarounds like enums, sumtypes, ManualDispatch, TypeSortedCollections, etc… Been there, done that with varying levels of success…

Related:

3 Likes

Here’s a more relevant issue:

In short, there’s a tradeoff. These are method table optimizations — they’re dependent upon the state of the method table. And thus, when you add more methods, you end up with method table invalidations. Generally when you have lots of methods, there will probably be more defined.

So, yes, you can increase max_methods and rebuild Julia if you want. But you’ll pay for it in other places — namely, higher compilation times and more invalidations.

2 Likes

Thx Matt - yes, I know the invalidation argument - and I know the Julia community has spent probably hundreds of man-years in a combined effort to reduce invalidations and reduce TTFX - to great success!

And reducing invalidations/TTFX is good - but in this case it results in suboptimal “production” code. I.e. it would favor fast REPLs over PackageCompile’d shipped production applications. At the risk of derailing the discussion, my take is that a huge effort has been put into optimizing Julia for interactive use - and less so for “production” use?

But in regards to your suggestion; instead of increasing max_methods, perhaps one way could be to tell the compiler only to “give up” once the number of potential return types at a call-site reaches a specific threshold (e.g. <=2)?

The answer is that it’s a tradeoff. See the table in the linked issue evaluating different algorithms and the tradeoffs therein. There are obviously different winners and losers in any tradeoff, and there are obviously different workloads that’ll win or lose in any given choice. Note that with max_methods=10, Julia takes so long to build it effectively never builds.

I wasn’t seriously suggesting increasing max_methods in a fork of Julia; it’s just the knob that’s responsible for this behavior. Perhaps max_methods could be higher under --compile=all? :person_shrugging:

Fixing boxing of immutable isbits (mutable args are already boxed) arguments during dynamic dispatch seems feasible. As the PR linked to the GH issue you shared shows though, it’d require major internal plumbing work. I suspect you could count the number of people able to do such work in this community on one hand, but it would be nice to see.

Fixing boxing of return values feels much harder, because Julia lacks function-level return type declarations like C++/Rust/Swift/etc. max_methods can work by brute-force devirtualizing the dispatch tree, but that’s like using a screwdriver to hammer in a nail. IMO it’s less a problem of focusing on “interactivity” vs “production”, but a novel language design problem (nobody has figured out how to get function return types + multiple dispatch + a type system as complex as Julia’s working, AFAIK).

What gives me hope is the ongoing work for compiled binary generation. If that effort explores techniques such as sealed method tables, the experience gained could drive future work on dynamic dispatch stability.

Note that there’s an escape hatch for changing max_methods if the code requires: julia/base/experimental.jl at 96866cb8f5c8f28d96c2b9e4eb1ec4f3a00a705b · JuliaLang/julia · GitHub

9 Likes

:exploding_head:

What does @max_methods(n::Int, fdef::Expr) require to move outside from the Experimental module?

1 Like

This is absolutely brilliant! Using this, the allocation disappears and the @btime drops 75%. This will solve - if not all - then a lot of my worries. Thank you Keno!!

1 Like