How slow is runtime dispatch, anyway? Benchmark attempts

It’s not actually easy to separate the timing of the runtime dispatches from the surrounding code, and I wasn’t much convinced by the few benchmarks people have tried, even if just counting the ones did control the number of dispatches. I can’t just time jl_apply_generic either because a runtime dispatch also involves preparing a call.

Making the type and 6x14x14 methods
# Julia v1.11.2
struct Call2arg{F,X,Y}
  f::F
  x::X
  y::Y
end

@inline (c::Call2arg)() = c.f(c.x, c.y)

# contained source of poor type inference
const Call2num = Call2arg{Function, Number, Number} 

funcs, inputs = let
  funcnames = Symbol.(collect("qwerty"))
  numtypes = [Bool, Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64,
              UInt64, Int128, UInt128, Float16, Float32, Float64]
  methodcount=0
  for func in funcnames
    for type1 in numtypes
      for type2 in numtypes
        methodcount += 1
        @eval $func(::$(Symbol(type1)), ::$(Symbol(type2))) = $methodcount
      end
    end
  end
  @show methodcount
  eval.(funcnames), one.(numtypes)
end

Attempt 1: simple benchmark loop. The @code_llvm reports suggest the type-unstable call just loads up a call frame and calls jl_apply_generic, so I’m fairly confident the only extraneous code are the calls themselves, which all just return an Int.

julia> using BenchmarkTools

julia> @btime $(Call2num(r, 1, 2.2))() # loop of runtime dispatch
  11.311 ns (0 allocations: 0 bytes)
700

julia> @btime $(Call2arg(r, 1, 2.2))(); # loop of loading `Int`
  1.000 ns (0 allocations: 0 bytes)

which suggests runtime dispatch adds ~10ns on my system.

Attempt 2: I tried to use a setup and evals=1 to randomize the call before each timing, but the weird results reminded me that a low number of evaluations per timed sample cannot accurately estimate such short timings.

Attempt 3: Timed a loop of randomized calls so it lasts long enough for decent timing. I moved as much of the randomization work outside the benchmark as possible.

julia> function loopcalls(randcalls)
         local x
         for i in eachindex(randcalls)  x = randcalls[i]()  end;
         x
       end;

julia> randcalls = let n=1_000_000
          Call2num.(rand(funcs,n), rand(inputs,n), rand(inputs,n))
       end;

julia> @btime loopcalls($randcalls)
  179.658 ms (0 allocations: 0 bytes)
258

It was hard to do a runtime dispatch-less version that didn’t optimize the loop away, so I had to do a non-equivalent loading of Int64.

julia> function loopblah(randcalls)
         local x
         for i in eachindex(randcalls)  x = randcalls[i].x  end
         x
       end;

julia> randcalls = let n=1_000_000
          Call2arg.(r, rand(Int64,n), rand(Float64,n))
       end;

julia> @btime loopblah($randcalls)
  32.100 μs (0 allocations: 0 bytes)
193604991833945415

The @code_llvm printout diverged much more so I’m not sure if this isolated the iteration overhead, but assuming it’s really this negligible, this suggests randomized runtime dispatch of these methods adds ~180ns. To justify that the randomization is responsible for the 10 to 180ns difference, the 3rd benchmark attempt can be done to resemble the 1st:

julia> randcalls = let n=1_000_000
          Call2num.(Ref(r), rand(Int,n), rand(Float64,n))
       end;

julia> @btime loopcalls($randcalls)
  11.154 ms (0 allocations: 0 bytes)
700

and we get the expected performance. I don’t actually know why this happens to the same code and method tables, but CPU caching has been suggested. However, omitting some functions and inputs from the random sampling (same million calls) seems to smoothly decrease the runtime, and I don’t know how that could be explained by caches.

You probably saw this coming, but the motivation was a comparison with CPython’s single dispatch; you might have heard about a JIT compiler, but the available binaries aren’t built with it so the dispatches are all runtime here.

>>> class A: # v3.13.0
...     def foo(self): return 1
...
>>> a = A()
>>> import timeit
>>> timeit.timeit("a.foo()", globals=globals(), number = 1_000_000) # seconds
0.03291300009004772

which adds ~33ns, unrandomized.
Python doesn’t have numeric primitives and metaprogramming like Julia’s, I can’t make an equivalent to the 2-argument dispatch, and there’s no compiler to optimize the property accesses, but

I tried to generate more heavily monkeypatched classes to make up for it.
funcnames = "qwerty"
halfclasses = "ZXCVBNMASDFGHJ"

methodcount = 0
inputs = []
for halfclass1 in halfclasses:
  for halfclass2 in halfclasses:
    exec(f"""
class {halfclass1}{halfclass2}: pass
""")
    inputs.append(eval(f"{halfclass1}{halfclass2}")())
    for func in funcnames:
      methodcount += 1
      exec(f"""
def {func}(self): return {methodcount}
{halfclass1}{halfclass2}.{func} = {func}
""")

import random

n = 1_000_000
randinputs = random.choices(inputs, k=n)
randfuncs = random.choices(funcnames, k=n)

def loopcalls(randfuncs, randinputs):
  for i in range(len(randfuncs)):
    x = getattr(randinputs[i], randfuncs[i])()
  return x

def loopblah(randfuncs, randinputs):
  for i in range(len(randfuncs)):
    x = 1
  return x

import timeit
>>> timeit.timeit("loopcalls(randfuncs,randinputs)", globals=globals(), number = 1)
0.16309330007061362

>>> timeit.timeit("loopblah(randfuncs,randinputs)", globals=globals(), number = 1)
0.018459899933077395

which adds ~145ns, randomized. When I omit methods and classes, the runtime decreases, too, though going to the extreme only gets to the added ~64ns, it doesn’t match the 1-call benchmark’s 33ns.

>>> randinputs = random.choices(inputs[:1], k=n)
>>> randfuncs = random.choices(funcnames[:1], k=n)
>>> timeit.timeit("loopcalls(randfuncs,randinputs)", globals=globals(), number = 1)
0.09056359995156527
>>> timeit.timeit("loopblah(randfuncs,randinputs)", globals=globals(), number = 1)
0.026401199982501566

So while you’d expect single dispatch to be more efficient than multiple dispatch, Python’s dynamism requires a costlier implementation. This 1176-method benchmark doesn’t show a significant difference (33-145ns vs 10-180ns), but the number of methods recently called does affect runtime and it could scale very differently over that or other factors for all I know.

2 Likes

While it’s interesting to see how much overhead dynamic method lookup and calling has in Julia, it’s only part of the reason dynamic dispatch is slower. Another important one that’s come up multiple times in practice is the overhead from boxing and unboxing arguments and return values that would otherwise be stack-allocated. `hash(::Vector{<abstract type>}, ::UInt)` allocates quite a bit due to type instability: why can't hash(::Any) infer `UInt` return type? · Issue #44244 · JuliaLang/julia · GitHub and [WIP] dynamic dispatch optimization: avoid boxing stack-allocated inputs by NHDaly · Pull Request #50136 · JuliaLang/julia · GitHub talk about this, and there should be related discussion threads on Discourse too.

Edit: this previous thread has similar benchmark and a comparison against Rust: Curious about the internals of dynamic dispatch - #13 by HashBrown

Suffice it to say that a benchmark that captures the totality of runtime dispatch overhead should consider these other factors, regardless of whether the purpose is to compare against other languages.

2 Likes

Oops, it looks like the compiler is now smart enough to world-split the dynamic dispatch here. I had to add a couple more methods to restore the dynamic dispatch:

using BenchmarkTools
using Random: Random

abstract type A end

struct B <: A
    b::Int64
end

struct C <: A
    c::Int64
end

struct D <: A
    d::Int64
end

struct E <: A
    e::Int64
end

get_val(b::B) = b.b
get_val(c::C) = c.c
get_val(d::D) = d.d
get_val(e::E) = e.e

function dynamic_dispatch(v::Vector{A})
    sum = 0
    for a in v
        sum += get_val(a)::Int64
    end
    sum
end

function static_dispatch(v::Vector{B})
    sum = 0
    for a in v
        sum += get_val(a)::Int64
    end
    sum
end

function main()
    Random.seed!(9001)
    v1 = Vector{A}()
    v2 = Vector{B}()
    for _ in range(1,1_000_000)
        x = rand([1, 2])
        if x === 1
            push!(v1, B(1))
            push!(v2, B(1))
        else
            push!(v1, C(2))
            push!(v2, B(2))
        end
    end
    println("Dynamic Dispatch:")
    @btime dynamic_dispatch($v1)
    println("Static Dispatch")
    @btime static_dispatch($v2)
end

main()

Performance:

Dynamic Dispatch:
  16.615 ms (0 allocations: 0 bytes)
Static Dispatch
  190.869 μs (0 allocations: 0 bytes)

And with a slight modification to show the impact of argument boxing:

# ... type and struct definitiions

get_val(b::B, t::NTuple{10, Int}) = b.b + t[1]
get_val(c::C, t::NTuple{10, Int}) = c.c + t[1]
get_val(c::D, t::NTuple{10, Int}) = c.c + t[1]
get_val(c::E, t::NTuple{10, Int}) = c.c + t[1]

function dynamic_dispatch(v::Vector{A}, t::NTuple{10, Int})
    sum = 0
    for a in v
        sum += get_val(a, t)::Int64
    end
    sum
end

function static_dispatch(v::Vector{B}, t::NTuple{10, Int})
    sum = 0
    for a in v
        sum += get_val(a, t)::Int64
    end
    sum
end

function main()
    Random.seed!(9001)
    v1 = Vector{A}()
    v2 = Vector{B}()
    t = ntuple(x -> x * 10_000, 10)
    for _ in range(1,1_000_000)
        x = rand([1, 2])
        if x === 1
            push!(v1, B(1))
            push!(v2, B(1))
        else
            push!(v1, C(2))
            push!(v2, B(2))
        end
    end
    println("Dynamic Dispatch:")
    @btime dynamic_dispatch($v1, $t)
    println("Static Dispatch")
    @btime static_dispatch($v2, $t)
end

main()

Performance:

Dynamic Dispatch:
  30.772 ms (2000000 allocations: 106.81 MiB)
Static Dispatch
  198.591 μs (0 allocations: 0 bytes)

The corresponding Rust code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{Rng, SeedableRng};
use rand::rngs::SmallRng;

trait A {
    fn get_val(&self, t: [i64; 10]) -> i64;
}

struct B {
    b: i64,
}

struct C {
    c: i64,
}

struct D {
    d: i64,
}

struct E {
    e: i64,
}

impl A for B {
    fn get_val(&self, t: [i64; 10]) -> i64 {
        self.b + t[0]
    }
}

impl A for C {
    fn get_val(&self, t: [i64; 10]) -> i64 {
        self.c + t[0]
    }
}

impl A for D {
    fn get_val(&self, t: [i64; 10]) -> i64 {
        self.d + t[0]
    }
}

impl A for E {
    fn get_val(&self, t: [i64; 10]) -> i64 {
        self.e + t[0]
    }
}

fn dynamic_dispatch(v: &Vec<Box<dyn A>>, t: [i64; 10]) -> i64 {
    let mut sum = 0;
    for ele in v {
        sum += ele.get_val(t);
    }
    sum
}

fn static_dispatch(v: &Vec<B>, t: [i64; 10]) -> i64 {
    let mut sum = 0;
    for ele in v {
        sum += ele.get_val(t);
    }
    sum
}

fn criterion_benchmark(c: &mut Criterion) {
    let mut v1 = Vec::<Box<dyn A>>::new();
    let mut v2 = Vec::<B>::new();
    let t: [i64; 10] = core::array::from_fn(|i| i as i64 * 10_000);

    let mut rng = SmallRng::seed_from_u64(9001);
    for _ in 0..1_000_000 {
        if rng.gen() {
            v1.push(Box::new(B { b: 1 }));
            v2.push(B { b: 1 });
        } else {
            v1.push(Box::new(C { c: 2 }));
            v2.push(B { b: 2 });
        }
    }

    c.bench_function("dynamic dispatch", |b| b.iter(|| {
        dynamic_dispatch(black_box(&v1), black_box(t))
    }));

    c.bench_function("static dispatch", |b| b.iter(|| {
        static_dispatch(black_box(&v2), black_box(t))
    }));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

And performance results from criterion:

dynamic dispatch        time:   [3.3195 ms 3.3268 ms 3.3355 ms]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

static dispatch         time:   [262.16 µs 264.59 µs 267.58 µs]
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe
2 Likes

The exact issue was that such benchmarks can’t show the contribution of each of them or be well controlled, especially with the compiler eliding as it pleases. In other words, runtime dispatch defined as all possible extra work on top of static dispatch is not one thing to be benchmarked. Note that your link to Hashbrown’s benchmark also reports 0 allocations; even then, the compilers will do anything they know to save work. Hashbrown demonstrated Rust is able to recognize all the types had the same size ahead-of-time and optimize; you could do the same in your updated benchmark and see how the compiler handles it. That’s why I tried hard to isolate the callframe loading and the method dispatch, the bare minimum of runtime dispatch that is always there, and remove as much other code as possible from the benchmark.

As for benchmarking allocations, I think that could be done better separately from runtime dispatch. But its usefulness is as questionable as the usefulness of my runtime benchmark’s large range; heap allocations are notably non-deterministic, and the GC matters in practice too. In another sense, performance depending on murky conditions means this isn’t one thing to benchmark either. Benchmarks of specific code like the unstable hash are what’s useful, but that does too many other things to say anything about runtime dispatch or allocations in particular.

It’s fine to benchmark one component of runtime dispatch to try for as precise an estimate as possible. It just offers an incomplete picture of the actual impact users will experience with dynamic dispatch in practice.

jl_apply_generic requires any arguments that aren’t already represented as jl_value_t * (i.e. boxed) to be represented as one. The calling convention requires boxing, and that means any apples-to-apples comparison of static to dynamic dispatch needs to include a scenario which captures this.[1]

To address individual points:

This seems incorrect. They found that Rust and Julia had around the same performance, which makes sense given both languages do similar things. Box in Rust erases the shape information, which is also what jl_value_t * (the effective eltype storage of an Any[]) does in Julia.

What that benchmark shows (and yours confirms) is that the method lookup and call parts of Julia’s dynamic dispatch machinery are pretty competitive. That’s great!

Again, there is close coupling between heap allocation and dynamic dispatch that can not be decoupled if we want to quantify the real-world impact of the latter. Idiomatic Julia rightfully encourages us to use immutable value types whenever possible, and those are almost always boxed with runtime dispatch. If anything, having a previously allocated Any array or array with a non-concrete element type helps eliminate GC variability, but our benchmarking tools have been designed to account for that too. Dismissing any benchmark which allocates as invalid is not productive.


  1. this is also why benchmarking against Python is not as interesting as against compiled languages with a virtual/non-virtual method distinction. In Python, almost everything is already boxed and you can’t separate out the impact of boxing from the rest of the noise. In contrast, languages like C++/Rust/C# which use vtables can get away with no implicit boxing. Part of this is because statically-typed single dispatch is just easier implementation-wise than Julia’s multiple dispatch under the hood (hence the mention of the calling convention). ↩︎

So does that mean the 0 allocations is because the abstractly typed fields in Call2num did the boxing in advance? A simple edit towards that end (which I think changes which call is dispatched at runtime) at least produces the allocations, though the timing doesn’t change for me as much as your benchmark:

julia> randcalls = let n=1_000_000 # Vector{Call2arg{F,X,Y} where {F,X,Y}}
         Call2arg.(rand(funcs,n), rand(inputs[2:end-1],n), rand(inputs[2:end-1],n))
       end;

julia> @btime loopcalls($randcalls)
  251.094 ms (666468 allocations: 15.26 MiB)
234

We’re not in disagreement here, and I’m not saying that any benchmarks mentioned so far are invalid. In fact, I asserted the opposite about benchmarking real code, and that’s in line with what many have cautioned about microbenchmarks. I am saying that they’re not measuring the specific irreducible action I’m trying to measure.

I however would disagree on my benchmark not having real-world impact. Easily switching from broadcasting Call2arg to Call2num can save work in following runtime-dispatched code, akin to preallocation. Despite how brittle those savings are in type-unstable code, I’d say that’s still practical advice. Interestingly, this suggests CPython’s disposition to put every object and field on the heap helps its interpreted performance in a way, the opposite of what’s usually talked about. This has an even bigger effect on my attempt at a loop overhead benchmark, to the point it’s not measuring loop overhead anymore:

julia> @btime loopblah($randcalls)
  32.461 ms (332931 allocations: 7.62 MiB)
0x01

julia> randcalls = let n=1_000_000 # Vector{Call2num}
         Call2num.(rand(funcs,n), rand(inputs[2:end-1],n), rand(inputs[2:end-1],n))
       end;

julia> @btime loopblah($randcalls)
  1.060 ms (0 allocations: 0 bytes)
1

That was HashBrown’s hypothesis, do you have another explanation for why HashBrown’s Rust benchmark worsened after the targeted edit to one of the types’ sizes? (I’m assuming the improvement in the static dispatch code over the edited type has completely different reasons like SIMD or something, but that’s besides the topic of runtime dispatch.)

I think so. The timing discrepancy could be explained by the size of the non-func inputs. Roughly 2x8 bytes for Call2num vs 10x8 bytes for the NTuple.

Yes, and I agree it’s useful because we don’t have a lot of advice out there about about this (cf. all the advice about function argument despecialization).

That said, my experience from previous discussion threads with real-world examples is that almost all of them involved boxing. For example, it’s just a really easy thing to run into the moment you start combining otherwise well-typed values with recursive ADTs. Most threads benefitted from returning to static dispatch using a sum types library, but that doesn’t change the fact that runtime dispatch is slower.

You’re right, I misunderstood their original edit. Here’s how things look with an extra 3x8 byte fields on struct C:

dynamic dispatch        time:   [5.5245 ms 5.5315 ms 5.5427 ms]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

static dispatch         time:   [248.60 µs 250.70 µs 253.19 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

However, it does not appear these extra fields change any of the codegen of the dynamic_dispatch function. Looking at the generated LLVM on Godbolt, the important part remains the same:

...
bb5:
  %ele = load ptr, ptr %_5, align 8, !dbg !288
  %_12.0 = load ptr, ptr %ele, align 8, !dbg !289
  %7 = getelementptr inbounds i8, ptr %ele, i64 8, !dbg !289
  %_12.1 = load ptr, ptr %7, align 8, !dbg !289
  %8 = getelementptr inbounds i8, ptr %_12.1, i64 24, !dbg !289
  %9 = load ptr, ptr %8, align 8, !dbg !289
  %_9 = call i64 %9(ptr align 1 %_12.0), !dbg !289
  %10 = load i64, ptr %sum, align 8, !dbg !291
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 %_9), !dbg !291
...

Comparing against the Julia LLVM codegen on Godbolt:

...
L36:                                              ; preds = %L65
    %6 = add i64 %value_phi1672, 1
    store ptr %12, ptr %gc_slot_addr_0, align 16
  store ptr %12, ptr %jlcallframe, align 8
  %7 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#1213.jit", ptr nonnull %jlcallframe, i32 1)

...

guard_exit:                                       ; preds = %L36, %L34
  %14 = phi ptr [ %7, %L36 ], [ %3, %L34 ]
  %value_phi1773 = phi i64 [ %15, %L36 ], [ 0, %L34 ]
  %value_phi1672 = phi i64 [ %6, %L36 ], [ 2, %L34 ]
   %.unbox = load i64, ptr %14, align 8
   %15 = add i64 %.unbox, %value_phi1773
    %16 = add i64 %value_phi1672, -1
...

While there’s no clear culprit for the performance difference, intuitively it makes sense that @ijl_apply_generic(...) might be slower than a vtable lookup + call.

1 Like

There are two layers of caching here:

  1. Julia caches the method-table lookup, cf julia/src/gf.c at 2cc296c8fd3cba6b7caa872d5fb13c947b190780 · JuliaLang/julia · GitHub
    2.The CPU guesses the branch-target

These are independent, but since branch-predictor misses don’t cost more than ~10ns, the big difference must be julia’s cache.

With respect to randomization, you’re probably seeing that julia’s cache is 4-way associative, i.e. there are 4 usable cache slots for every callsite (as in: instruction pointer of the jl_call_generic instruction), and on cache-miss a random entry is evicted.

FWIW, I think there is a lot of optimization potential in that code!

2 Likes

That looks like debug mode’s overflow checks. Is your benchmark using debug mode? If not, release mode’s IR might show what’s going on.

Yeah, an AOT compiler knowing the return type of a known callable and the finite number of types of the 1 varying argument helps a lot. Multiple dispatch can vary the type of the callable and the other arguments too (like in my benchmark), and that really can’t be handled like single dispatch of a known callable. Some thoughts on a JIT-compiled interactive language’s options:

  • Sum types can specify a finite number of types for 1 argument.
  • If the callable is known at compile-time, then being able to assert a return type (justified for hash, as it’s documented to return UInt) seems like it can easily avoid allocations, even with less efficient multiple dispatch over unknown argument types
  • If the callable is unknown but the argument types are, FunctionWrappers (with better support hopefully) can cache the dispatch and convert to a specific return type.

The timing varies across a lot more than just 1-4 call signatures (pretty much the entire 1 to 1176 methods), but the gf.c code is going way over my head so…

The Python benchmark’s timing also varies by >>10ns, and AFAIK all that happens at a method dispatch is

  • check typeof the instance for the class
  • look up the method name in the class’s dictionary for the method
  • instantiate a bound method that forwards to the method
  • call bound method

so that makes me think it’s low-level caching. Could it be the compiled methods being cached? The methods are pretty small, maybe the CPU cache can hold them?