Elimination of (unnecessary) runtime dispatch allocations

nielsls · April 24, 2024, 8:13pm

Avoiding runtime dispatch is perhaps the #1 optimization advice in Julia forums.
As I understand it, the reason is that when the Julia compiler does a runtime dispatch it does not at compile-time neccesarily know the return type of the function to be called. And so, it needs to allocate space for the return value on the heap (“box’ed”).

Hence, common advice is to avoid runtime dispatch. However, this is not always possible - probably the most frequent example is when processing a vector of heterogenous elements. E.g. a GUI library drawing a list of screen components, a ray tracer processing a list of light sources, a financial library pricing a list of financial instruments, etc…

The minimal viable example is this:

using BenchmarkTools
using Test
using JET

abstract type AbstractFoo end
struct Foo1 <: AbstractFoo end
struct Foo2 <: AbstractFoo end
struct Foo3 <: AbstractFoo end
struct Foo4 <: AbstractFoo end
struct Foo5 <: AbstractFoo end

f(::Foo1) = rand()
f(::Foo2) = rand()
f(::Foo3) = rand()
f(::Foo4) = rand()
f(::Foo5) = rand()

g(arr) = f(first(arr))
foos = AbstractFoo[Foo1()]

@btime g($foos) # 25.978 ns (1 allocation: 16 bytes)
@inferred g(foos) # return type Float64 does not match inferred return type Any
@report_opt g(foos) # runtime dispatch detected: f(%1::AbstractFoo)::Any

In the above example f() always returns a Float64 and there is a valid f for each possible instance of AbstractFoo; yet, the compiler prepares for any return type leading to an (unneccesary) allocation.

I’m not here to ask for specific compiler optimizations - but if I were - this would be at the top of my list. IMHO Julia can do (almost?) anything C++/Rust can do - but easier and just as fast - except for this! And this is a very common (and very valid) programming paradigm.

This topic has been raised before (see related), yet I hope more attention may be beneficial or provide further information. My current capacity only allows for shedding light on the issue - not solving it…

And please note, I’m not looking for workarounds like enums, sumtypes, ManualDispatch, TypeSortedCollections, etc… Been there, done that with varying levels of success…

`hash(::Vector{<abstract type>}, ::UInt)` allocates quite a bit due to type instability: why can't hash(::Any) infer `UInt` return type?

opened 07:16PM - 18 Feb 22 UTC

NHDaly

performance stdlib

Hi! Here's a very small MRE that reproduces the issue I'll discuss below: `…``julia julia> @btime hash($(Any[i for i in 1:1000]), UInt(0)) 20.558 μs (1001 allocations: 15.64 KiB) 0x0e97df552a8d635e ``` And you can see that all those allocations are UInts: ```julia julia> import Profile julia> const v = Any[i for i in 1:1000]; julia> Profile.Allocs.clear(); Profile.Allocs.@profile sample_rate=1 hash(v, UInt(0)) 0x0e97df552a8d635e julia> results = Profile.Allocs.fetch(); julia> length(results.allocs) 1001 julia> unique([alloc.type for alloc in Profile.Allocs.fetch().allocs]) 1-element Vector{DataType}: UInt64 ``` ---------------------- We've been using the new allocations profiler on our codebase, and we've found that on some benchmarks we are spending roughly 30% of allocations on this method instance: `hash(::Vector{RelationalAITypes.DBType}, ::UInt64)`: <img width="677" alt="Screen Shot 2022-02-18 at 1 54 01 PM" src="https://user-images.githubusercontent.com/1582097/154744875-97cf1d9a-2fc3-402d-b830-98c50c2f34d9.png"> You can see that of these allocations, roughly 50% are allocating the UInt64s of the results of iterating the vector. Here's the reported allocation locations: ```julia Total: 14323 25157 (flat, cum) 30.66% 2428 . . n = 0 2429 . . while true 2430 . . n += 1 2431 . . # Hash the current key-index and its element 2432 . . elt = A[keyidx] 2433 10798 12908 h = hash(keyidx=>elt, h) 2434 . . 2435 . . # Skip backwards a Fibonacci number of indices -- this is a linear index operation 2436 . . linidx = key_to_linear[keyidx] 2437 . . linidx < fibskip + first_linear && break 2438 . . linidx -= fibskip 2439 . . keyidx = linear_to_key[linidx] 2440 . . 2441 . . # Only increase the Fibonacci skip once every N iterations. This was chosen 2442 . . # to be big enough that all elements of small arrays get hashed while 2443 . . # obscenely large arrays are still tractable. With a choice of N=4096, an 2444 . . # entirely-distinct 8000-element array will have ~75% of its elements hashed, 2445 . . # with every other element hashed in the first half of the array. At the same 2446 . . # time, hashing a `typemax(Int64)`-length Float64 range takes about a second. 2447 . . if rem(n, 4096) == 0 2448 . . fibskip, prevfibskip = fibskip + prevfibskip, fibskip 2449 . . end 2450 . . 2451 . . # Find a key index with a value distinct from `elt` -- might be `keyidx` itself 2452 3525 12249 keyidx = findprev(!isequal(elt), A, keyidx) 2453 . . keyidx === nothing && break 2454 . . end 2455 . . 2456 . . return h 2457 . . end ``` You can see that most of the direct allocations come from the call to `hash(elt, h)`. I know that this is because there is a type instability there, since the Vector is a vector of abstract elements. But my question is: even if you don't know the type of the element, why can't julia deduce the return type to be a UInt in this case? And if it could, would that be enough to avoid allocating the return value? I see that currently roughly 1/3 of the `hash()` methods in base cannot infer their return type: ```julia julia> length([i for (i,v) in enumerate(Base.return_types(hash)) if v==Any]), length(methods(hash)) (22, 68) ``` Here's a small sample of the ones that it reports with return type Any: ```julia [4] hash(F::LinearAlgebra.Eigen, h::UInt64) in LinearAlgebra at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/eigen.jl:627 [6] hash(z::Complex, h::UInt64) in Base at complex.jl:255 [8] hash(s::AbstractSet, h::UInt64) in Base at set.jl:433 [10] hash(r::Distributed.AbstractRemoteRef, h::UInt64) in Distributed at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:139 [11] hash(a::AbstractDict, h::UInt64) in Base at abstractdict.jl:522 [12] hash(g::Base.Unicode.GraphemeIterator, h::UInt64) in Base.Unicode at strings/unicode.jl:718 [13] hash(x::NamedTuple, h::UInt64) in Base at namedtuple.jl:196 [14] hash(F::LinearAlgebra.QRCompactWY, h::UInt64) in LinearAlgebra at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:152 [17] hash(p::Pair, h::UInt64) in Base at pair.jl:39 ``` **I'm wondering: would it be acceptable to put a return type annotation and/or return-site type assertion on all these methods asserting that they return a UInt64?** One plausible explanation might be that we _don't_ require `hash(x, h)` to return a UInt, but I don't think that's true, because if the hash function returns anything other than a UInt, _it will cause a MethodError when the returned hash value is passed along as the seed to the next call to hash()!_ Consider this example: ```julia julia> struct MyType x::Int end julia> Base.hash(m::MyType, h::UInt) = UInt128(hash(m.x, h)) julia> hash([MyType(2), MyType(3)]) ERROR: MethodError: no method matching hash(::MyType, ::UInt128) Closest candidates are: hash(::MyType, ::UInt64) at REPL[6]:1 hash(::Any) at ~/src/julia/usr/share/julia/base/hashing.jl:20 hash(::Any, ::UInt64) at ~/src/julia/usr/share/julia/base/hashing.jl:25 Stacktrace: [1] hash(A::Vector{MyType}, h::UInt64) @ Base ./abstractarray.jl:3018 [2] hash(x::Vector{MyType}) @ Base ./hashing.jl:20 [3] top-level scope @ REPL[7]:1 ``` So it seems to me that there is an (implicit) requirement in the current API that Base.hash functions **must** return a UInt64. So my question is can we then take advantage of that requirement to provide a hint to the compiler that can eliminate the allocations in the presence of this type instability?

mbauman · April 24, 2024, 8:25pm

Here’s a more relevant issue:

github.com/JuliaLang/julia

Function with four or more method definitions has unstable return type

opened 04:34AM - 11 Aug 17 UTC

wsshin

inference

Consider the following code: ```julia abstract type MyAbs end test(x::Abstrac…tVector, m::MyAbs) = test(Vector{Int}(x), m) struct MyType1 <: MyAbs end test(x::Vector{Int}, ::MyType1) = 1 struct MyType2 <: MyAbs end test(x::Vector{Int}, ::MyType2) = 2 struct MyType3 <: MyAbs end test(x::Vector{Int}, ::MyType3) = 3 ``` Here, the first definition of `test` defers execution to one of the next three definitions of `test` depending on the specific type of `m`. Now, if we call `test` on `Vector{Int}` and `MyAbs`, the return type is always `Int` and therefore stable: ```julia julia> VERSION v"0.7.0-DEV.1283" julia> code_warntype(test, (Vector{Int}, MyAbs)) Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 ``` However, if we define one more, 4th concrete subtype of `MyAbs` and corresponding `test` function, suddenly calling `test` on `Vector{Int}` and `MyAbs` returns `Any` and becomes unstable: ```julia julia> struct MyType4 <: MyAbs end julia> test(x::Vector{Int}, ::MyType4) = 4 test (generic function with 5 methods) julia> code_warntype(test, (Vector{Int}, MyAbs)) Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Int64 Variables: ... Body: ... end::Any ``` Note the last return type is `Any`. This is very strange. Is 4 some kind of an intentional threshold? This is a problem when, for example, I define an array of `MyAbs` containing instances of `MyType1`, ..., `MyType4` and iterate over it by a for loop to execute `test` on each element of the array. The output of `test` is not type-stable. The problem is reproducible in Julia 0.6 as well.

In short, there’s a tradeoff. These are method table optimizations — they’re dependent upon the state of the method table. And thus, when you add more methods, you end up with method table invalidations. Generally when you have lots of methods, there will probably be more defined.

So, yes, you can increase max_methods and rebuild Julia if you want. But you’ll pay for it in other places — namely, higher compilation times and more invalidations.

nielsls · April 24, 2024, 8:56pm

Thx Matt - yes, I know the invalidation argument - and I know the Julia community has spent probably hundreds of man-years in a combined effort to reduce invalidations and reduce TTFX - to great success!

And reducing invalidations/TTFX is good - but in this case it results in suboptimal “production” code. I.e. it would favor fast REPLs over PackageCompile’d shipped production applications. At the risk of derailing the discussion, my take is that a huge effort has been put into optimizing Julia for interactive use - and less so for “production” use?

But in regards to your suggestion; instead of increasing max_methods, perhaps one way could be to tell the compiler only to “give up” once the number of potential return types at a call-site reaches a specific threshold (e.g. <=2)?

mbauman · April 24, 2024, 9:39pm

The answer is that it’s a tradeoff. See the table in the linked issue evaluating different algorithms and the tradeoffs therein. There are obviously different winners and losers in any tradeoff, and there are obviously different workloads that’ll win or lose in any given choice. Note that with max_methods=10, Julia takes so long to build it effectively never builds.

github.com/JuliaLang/julia

new call site inference algorithm

opened 08:24PM - 12 Feb 20 UTC

JeffBezanson

speculative inference latency

This issue elaborates on parts of #33326. One of the most crucial choices in …our type inference algorithm is what to do when a call site has multiple possible targets (i.e. the inferred argument types aren't specific enough to know exactly which method would be called). Currently, we simply infer up to 4 possible targets, and if more than 4 match we give up and say `Any`. So for example you can get the following behavior: ``` julia> f(::Int) = 0; julia> f(::String) = 0; julia> f(::Dict) = 0; julia> f(::Array) = 0; julia> g() = f(Any[1][1]); julia> @code_typed g() ... ) => Int64 julia> f(::Tuple) = 0; julia> @code_typed g() ... ) => Any ``` This is clearly non-ideal, since the inferred type changes abruptly when a fifth unrelated method is added. To some extent this is unavoidable --- any inference of call sites with inexact argument types might change if a method is added or its signature changes. But, a new algorithm could still do better in three ways: (1) be less arbitrary and more intuitive, (2) make inference generally faster, (3) reduce the likelihood of inference "blowups" as in issues like #34098, #34681. I have tried a couple alternate algorithms; data incoming!

I wasn’t seriously suggesting increasing max_methods in a fork of Julia; it’s just the knob that’s responsible for this behavior. Perhaps max_methods could be higher under --compile=all?

ToucheSir · April 24, 2024, 9:41pm

Fixing boxing of immutable isbits (mutable args are already boxed) arguments during dynamic dispatch seems feasible. As the PR linked to the GH issue you shared shows though, it’d require major internal plumbing work. I suspect you could count the number of people able to do such work in this community on one hand, but it would be nice to see.

Fixing boxing of return values feels much harder, because Julia lacks function-level return type declarations like C++/Rust/Swift/etc. max_methods can work by brute-force devirtualizing the dispatch tree, but that’s like using a screwdriver to hammer in a nail. IMO it’s less a problem of focusing on “interactivity” vs “production”, but a novel language design problem (nobody has figured out how to get function return types + multiple dispatch + a type system as complex as Julia’s working, AFAIK).

What gives me hope is the ongoing work for compiled binary generation. If that effort explores techniques such as sealed method tables, the experience gained could drive future work on dynamic dispatch stability.

Keno · April 24, 2024, 10:01pm

Note that there’s an escape hatch for changing max_methods if the code requires: julia/base/experimental.jl at 96866cb8f5c8f28d96c2b9e4eb1ec4f3a00a705b · JuliaLang/julia · GitHub

Tortar · April 24, 2024, 10:13pm

What does @max_methods(n::Int, fdef::Expr) require to move outside from the Experimental module?

nielsls · April 24, 2024, 11:01pm

This is absolutely brilliant! Using this, the allocation disappears and the @btime drops 75%. This will solve - if not all - then a lot of my worries. Thank you Keno!!

matthias314 · May 4, 2024, 5:26pm

Sorry for chiming in late.

In a similar situation, I once thought that it would be great if one could give additional return type information to the compiler. In the setting of the OP, I’m thinking of something like

@declare_return_type f(::AbstractFoo)::Float64

Such a statement would promise that f will always return a Float64 whenever the argument is of type AbstractFoo. The compiler would take this into account for type inference and throw an error if some method for f breaks the promise.

Because of union splitting, even a non-concrete return type like

@declare_return_type f(::AbstractFoo)::Union{Float32,Float64}

would speed up things.

However, I don’t know much about the Julia internals. Would something like that be feasible at all?

mrufsvold · May 4, 2024, 10:11pm

Check out this conversation

matthias314 · May 4, 2024, 11:17pm

The thread you’ve mentioned seems to discuss the idea from the perspective of enforcing interfaces, not as a means to avoid runtime dispatch. Has the latter been considered anywhere?

mrufsvold · May 4, 2024, 11:24pm

Well, if you used the declaration mechanism to declare that the function always returns a certain type, then all implementations would have to return that type.

But also, I think you can already do what you’re talking about with FunctionWrappers.jl, no?

Edit: without enforcing the return type for the function, there is a tradeoff decision that the compiler has to make:

See that all currently defined methods return the same time and specialize the function assuming that type… But risk that compiled code being invalidated if a new method is defined (meaning it will need to be recompiled)
Ignore that information and keep the code generic and slower

matthias314 · May 5, 2024, 1:54am

I could also do

ff(x::AbstractFoo) = f(x)::Float64

Such an ff is not faster than f, but type-stable. Compared to

const fw = FunctionWrapper{Float64, Tuple{AbstractFoo}}(f)

you don’t lose time when the argument type is known:

julia> @b ff(Foo1())
3.165 ns

julia> @b fw(Foo1())
24.891 ns (1 allocs: 16 bytes)

(I hope I’m understanding FunctionWrapper correctly here.)

Still, It would be nice to be able to do this without an umbrella function like ff.

I had an approach in mind where you cannot opt out of a return type contract. My hope would be that this avoids method invalidations.

mrufsvold · May 5, 2024, 12:00pm

If I’m reading you correctly, this is exactly what I was proposing in that thread. @declare would create an outer function with type assertions and @implement would create methods of an inner function with different kinds of logic. With function sealing (preventing new methods of a function from being created) this would avoid recompilation.

But @ChrisRackauckas pointed out that autodiff tracing needs to be able to create new dispatches of methods for their symbolic types, so implementing something like this in Base would break core parts of the ecosystem.

Returning to this thought, I suppose it’s possible that if you only @declared for narrower abstract types (AbstractFoo), and had a way to seal a function from new methods dispatching on subtypes of that type, then it would still leave the door open to dispatching other types.

It wouldn’t make the function return fully stable, but it would make map(f, ::Vector{AbstractFoo}) stable!

Topic		Replies	Views
Type-instability, runtime dispatch, and heap allocations Performance question , type-stability , runtime-dispatch	3	162	April 12, 2025
Dispatch on Value allocating Performance question	9	1711	July 14, 2019
Dangers of abusing multiple dispatch: missing optimization? General Usage	1	606	April 14, 2017
Why is this code run-time dispatch/slow? Performance	16	849	September 25, 2020
Why does runtime dispatch allocate when the return type is inferred? Internals & Design memory-allocation , runtime-dispatch	5	215	May 17, 2025

Elimination of (unnecessary) runtime dispatch allocations

Related topics