Curious about the internals of dynamic dispatch

My mental model of how Julia handles dynamic dispatch is by compiling code that looks like:

for ele in vector_of_abstract_type
    do_something(ele)
end

to something that looks like this

for ele in vector_of_abstract_type
    type_of_ele = ele.type # inspect some type information stored with the object
    if type_of_ele in do_something_method_table
        do_something_method_specialization = do_something_method_table[type_of_ele] # lookup method specialization
        do_something_method_specialization(ele) # jump to already compiled code machine code
    else
        do_something_function_specialization = determine_method(do_something, type_of_ele) # determine which method of the do_something_function to use
        do_something_method_specialization = compile(do_something_function_specialization, type_of_ele) # compile a specialization of the method based on type_of_ele
        do_something_method_table[type_of_ele] = do_something_method_specialization # store a reference to the final compiled machine code
        do_something_method_specialization(ele) # jump to the machine code
    end
end

Basically the cost of dynamic dispatch according to this is the same as a single hash table lookup if the specialization has already been compiled. Else it requires determining the method to use and then compiling that method for our specific type.

I differentiate method speicalizations (raw machine code) from function specializations (which method of a function is called).

1.) Is this what actually happens? Does the hash table map the types of the inputs to the method (function specialization) or the final compiled machine code location (method speicalization)? In my understanding above it was the latter.

2.) If do_something is roughly 10x the cost of a dictionary lookup then can we ignore the cost of dynamic dispatch (given that the type instability doesn’t propagate to other parts of the code)? Is dictionary lookup a good baseline cost comparison?

3.) If we know that the return type of do_something is the same, no matter what the type of ele is, can we assume that the compiler also knows? Can we use a return value in this case in a type stable way? Or should we put a type assertion? If we put a type assertion will the compiler be able to continue in a type stable way with that return value?

4.) If do_something adds a new method to do_something then will the hash table be emptied out? Thus causing the else branch to execute every time leading to very slow code?

5.) Kind of unrelated, but does the compiler compile as deep as it can until it hits a value with a type it cannot determine at compile time, insert a callback to itself there, unwind and execute? Or does it just compile the local function and at any function call point insert a callback to itself, followed by an overwriting of that callback once it compiles the callee?

6.) Does Julia store some type identifying information with every piece of data?

1 Like

The dispatch rule itself is NP-hard and really complicated to list on… there are many heuristics to it. Rest assured, it is trying to be fast…
Adding a method to the supertype of A might not throw away the hash of subtypes of A, for example…

But the system is complicated and… subject to change over time. Only the behavior is defined, even that is non-trivial. The performance can be optimized over time.

Why is it NP Hard (perhaps formulating the exact problem may help me see that)? It seems O(1) to me if we have encountered the type combination before. Just look up where the function resides in a hash table given the types of the arguments and jump to it?

The first time…
Even the question of whether or not some particular combination of something can be successfully dispatched is NP-hard.

Cache lookup can be O(1), but that’s cache lookup, not solving the problem itself.

1 Like

This might be an interesting read on this topic https://dl.acm.org/doi/10.1145/3276483 “Julia subtyping: a rational reconstruction”

2 Likes

Jeff’s thesis is another very good high-level authoritative source. It’s old, but many of the general strategies are still accurate.

2 Likes

Ah yes I see, so the else branch in the pseudo code above can be quite slow not just from the compilation but from method selection (but heuristics make it pretty good).

But the if branch is just dictionary lookup to the compiled code?

I ran some benchmarks comparing dynamic dispatch cost in Julia vs Rust.

In Julia I got the following results:

Dynamic Dispatch:
  4.677 ms (0 allocations: 0 bytes)
Static Dispatch
  159.291 ÎĽs (0 allocations: 0 bytes)

And in Rust I got the following:

Dynamic dispatch runtime: 1.62775ms
Static dispatch runtime: 156.541µs

Below is the code I ran:

Julia code:

using BenchmarkTools

abstract type A end

struct B <: A 
    b::Int64
end

struct C <: A 
    c::Int64
end

function get_val(b::B)
    b.b
end

function get_val(c::C)
    c.c
end

function dynamic_dispatch(v::Vector{A})
    sum = 0
    for a in v
        sum += get_val(a)::Int64
    end
    sum
end

function static_dispatch(v::Vector{B})
    sum = 0
    for a in v
        sum += get_val(a)::Int64
    end
    sum
end

function main()
    v1 = Vector{A}()
    v2 = Vector{B}()
    for _ in range(1,1_000_000)
        x = rand([1, 2])
        if x === 1
            push!(v1, B(1))
            push!(v2, B(1))
        else
            push!(v1, C(2))
            push!(v2, B(2))
        end
    end
    println("Dynamic Dispatch:")
    @btime dynamic_dispatch($v1)
    println("Static Dispatch")
    @btime static_dispatch($v2)
end

main()

Rust code:

trait A {
    fn get_val(&self) -> i64;
}

struct B {
    b: i64,
}

struct C {
    c: i64,
}

impl A for B {
    fn get_val(&self) -> i64 {
        self.b
    }
}

impl A for C {
    fn get_val(&self) -> i64 {
        self.c
    }
}

fn dynamic_dispatch(v: &Vec<Box<dyn A>>) -> i64 {
    let mut sum = 0;
    for ele in v {
        sum += ele.get_val();
    }
    sum
}

fn static_dispatch(v: &Vec<B>) -> i64 {
    let mut sum = 0;
    for ele in v {
        sum += ele.get_val();
    }
    sum
}

fn main() {
    let mut v1 = Vec::<Box<dyn A>>::new();
    let mut v2 = Vec::<B>::new();

    for _ in 0..1_000_000 {
        if rand::random() {
            v1.push(Box::new(B { b: 1 }));
            v2.push(B { b: 1 });
        } else {
            v1.push(Box::new(C { c: 2 }));
            v2.push(B { b: 2 });
        }
    }

    let now = std::time::Instant::now();
    let r1 = dynamic_dispatch(&v1);
    let elapsed: std::time::Duration = now.elapsed();
    println!("Dynamic dispatch runtime: {:?}", elapsed);
    let now = std::time::Instant::now();
    let r2 = static_dispatch(&v2);
    let elapsed: std::time::Duration = now.elapsed();
    println!("Static dispatch runtime: {:?}", elapsed);
    println!("{}, {}", r1, r2);
}

I ran the rust code a bunch to get multiple sample points (release build). Basically the static dispatch has the same performance, but Rust’s dynamic dispatch is significantly faster. This matches my general feel for Julia’s dynamic dispatch. It just feels a little slower than I expected relative to method lookups in the other “fast” languages like C++ and Rust. Is there a good reason for this? Is there some dynamic behavior allowed in Julia that is not in the other languages that makes Julia this much slower when it comes to dynamic dispatch?

Julia was not optimized for dynamic dispatch. Caching can be done but it can also be invalidated. Things are trade-off. I don’t know the Julia internal and it can always be changed.

But the problem here, I think, is indirection. Type-unstable things needs to be boxed in Julia and this creates indirection which ruins performance.

To be more complete and actually using Criterion.rs (benchmarking framework) to benchmark the rust code yields an output of:

dynamic dispatch        time:   [983.10 µs 983.59 µs 984.03 µs]
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

static dispatch         time:   [99.710 µs 99.997 µs 100.40 µs]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

My point in the benchmarks above being that there is the same level of indirection conceptually (as far as I understand) in both the Julia and the Rust code. Can the performance difference really be attributed to Rust highly optimizing the dynamic dispatch (same concept)?

Virtual method table… and Mayyyyybe optimizing away box. In rust… it can be an array lookup.

In Julia, dynamic dispatch needs at least a dictionary lookup.

IDK how memory allocation of box happens. Maybe Rust can optimize that? IDK.

Hmm, I noticed that my types are of the same size so there is a concern of some optimization there. I changed B to also contain a Float64/f64.

Now we have Julia:

Dynamic Dispatch:
  4.673 ms (0 allocations: 0 bytes)
Static Dispatch
  256.708 ÎĽs (0 allocations: 0 bytes)

and Rust:

dynamic dispatch        time:   [4.4012 ms 4.4029 ms 4.4048 ms]
                        change: [+361.64% +362.24% +362.81%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

static dispatch         time:   [260.19 µs 260.35 µs 260.53 µs]
                        change: [+172.74% +173.19% +173.81%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Looks like they match up pretty closely now. The point of dictionary lookup vs array lookup is a good one. I suppose Julia is forced to be a little slower on that front to allow for extra flexibility

2 Likes