Speed of multiple dispatch

hlbnet · November 25, 2023, 7:21pm

Hello,
In the following example, multiple dispatch is more that 10 times slower than typeof comparisons.

Is it expected, or is there some issue in my code ?

Ouput:
Size=20000

If typeof
1006900000
1.576137 seconds (5.35 k allocations: 403.821 KiB, 0.83% compilation time)
Multiple dispatch
1006900000
21.467297 seconds (400.01 M allocations: 5.961 GiB, 1.67% gc time, 0.17% compilation time)
If typeof
1006900000
1.537739 seconds (8 allocations: 240 bytes)
Multiple dispatch
1006900000
21.652412 seconds (400.00 M allocations: 5.961 GiB, 1.36% gc time, 0.00% compilation time)

The code:

struct S1 end
struct S2 end

do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4

function do_typeof(left, right)
    if typeof(left) === S1
        if typeof(right) === S1
            return 1
        elseif typeof(right) === S2
            return 2
        end
    elseif typeof(left) === S2
        if typeof(right) === S1
            return 3
        elseif typeof(right) === S2
            return 4
        end
    end
end

function test(size)
    println("Size=", size)
    items = rand([S1(), S2()], size)

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))
end

test(20000)

tverho · November 25, 2023, 8:00pm

You didn’t include the code how you did the benchmarking? The issue seems to lie there, the first hint is the allocations that shouldn’t be there. This is how it should be done:

using BenchmarkTools
left = S1()
right = S2()

julia> @btime do_dispatch($left, $right)
  1.015 ns (0 allocations: 0 bytes)
2

julia> @btime do_typeof($left, $right)
  1.015 ns (0 allocations: 0 bytes)
2

The speed is the same because the compiler generates identical code for both:

julia> @code_llvm do_dispatch(left, right)
;  @ REPL[4]:1 within `do_dispatch`
define i64 @julia_do_dispatch_332() #0 {
top:
  ret i64 2
}

julia> @code_llvm do_typeof(left, right)
;  @ REPL[7]:1 within `do_typeof`
define i64 @julia_do_typeof_341() #0 {
top:
  ret i64 2
}
``

Zentrik · November 25, 2023, 8:48pm

They did include the code, you probably didn’t scroll down in their code block.

The performance difference seems reasonable, unfortunately multiple dispatch is just not that fast.

mkitti · November 25, 2023, 9:06pm

It looks like they did the classic mistake of including compile time in the benchmarks.

There also seems to be something off with the allocations.

mcabbott · November 25, 2023, 9:47pm

Note that items is of type Vector{Any}. With a narrower type, both paths are faster, the dispatch code equally fast. (Only for sufficiently small unions.)

julia> rand([S1(), S2()], 3)
3-element Vector{Any}:
 S2()
 S1()
 S2()

julia> rand(Union{S1,S2}[S1(), S2()], 3)
3-element Vector{Union{S1, S2}}:
 S1()
 S2()
 S2()

I believe that the reason dynamic dispatch is slower is that the code compiled for Vector{Any}) cannot exploit the knowledge that do_dispatch has exactly 4 methods. This may change (e.g. you could define struct S3) and the I don’t think the compiled code will be discarded.

Whereas the code for Vector{Union{S1, S2}} is going to be essentially the same branches you wrote, probably inlining do_dispatch. This code can never encounter some new S3, and I believe it will be thrown away if you change the definitions of do_dispatch it uses.

mkitti · November 25, 2023, 9:52pm

@mcabbott beat me to it. His answer is correct.

julia> function test(size)
           println("Size=", size)
           items = rand(Union{S1,S2}[S1(), S2()], size)

           println("- If typeof")
           @time println(sum(do_typeof(x, y) for x in items, y in items))

           println("- Multiple dispatch")
           @time println(sum(do_dispatch(x, y) for x in items, y in items))

           println("- If typeof")
           @time println(sum(do_typeof(x, y) for x in items, y in items))

           println("- Multiple dispatch")
           @time println(sum(do_dispatch(x, y) for x in items, y in items))
       end
test (generic function with 1 method)

julia> test(1000)
Size=1000
- If typeof
2491000
  0.002022 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
  0.001944 seconds (8 allocations: 232 bytes)
- If typeof
2491000
  0.002671 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
  0.001947 seconds (8 allocations: 232 bytes)

julia> test(20000)
Size=20000
- If typeof
994060000
  1.531119 seconds (9 allocations: 544 bytes)
- Multiple dispatch
994060000
  1.514052 seconds (8 allocations: 240 bytes)
- If typeof
994060000
  1.585021 seconds (8 allocations: 240 bytes)
- Multiple dispatch
994060000
  1.542702 seconds (8 allocations: 240 bytes)

Make sure to do a warm up run on test or use BenchmarkTools.

Lilith · November 25, 2023, 10:15pm

I just want to say that @hlbnet’s original benchmarks are perfectly sound, reproducible, included both a warmup and second evaluation, and documented the proportion of time spent on compilation (which happened to be negligible).

Thank you @mcabbott for providing an accurate answer, and apologies to @hlbnet for the community’s misplaced criticism of your benchmarks. We get a lot of new users who report strange benchmark results due to poor benchmarking and it’s an understandable though unfortunate mistake to lump you in with those folks.

tverho · November 26, 2023, 7:18am

My mistake, I didn’t hover on the code so I didn’t see the scroll bar and the test function…

Also, I perhaps missed the point a little bit if the idea was to test dynamic dispatch where the compiler doesn’t know in advance which function will be called. As mcabbott noticed, the dispatch version slows down and allocates a lot if the the vector type doesn’t contain any information on the types. However, the version using conditions is not affected. It is not at all obvious why this should be the case? Why can’t the dispatch system do something equivalent to making typeof tests to avoid allocations and a dramatic slowdown?

So although a simple fix exists (using a narrower type for the vector) the question remains why the dispatch version does so poorly compared to the reference in this particular case?

tverho · November 26, 2023, 7:44am

(EDIT: as shown below, this seems to be a false result because of compiler optimizing away the loop when I didn’t print out the result)

There is also other factors here affecting the performance… In my experience, generator expressions and closures don’t always work as you would hope performace-wise, and the allocations suggested something might be going on. So I ran a test using an old-fashioned for-loop and got another order of magnitude reduction in run time:

function test(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)  # Note the narrower type

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))
end

function test_with_for(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)

    println("- If typeof")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    
    println("- Multiple dispatch")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
end
    
# Compile
test(20000)
test_with_for(20000)

julia> test(20000)
Size=20000
- If typeof
997900000
  1.458255 seconds (9 allocations: 544 bytes)
- Multiple dispatch
997900000
  1.488013 seconds (8 allocations: 240 bytes)

julia> test_with_for(20000)
Size=20000
- If typeof
  0.212348 seconds
- Multiple dispatch
  0.211795 seconds

hlbnet · November 27, 2023, 8:24am

I have replaced the following line:
items = rand([S1(), S2()], size)

By this one:
items = rand(Union{S1,S2}[S1(), S2()], size)

The test with multiple dispatch is then dramatically speed up, achieving same performance as the typeof test.

That is indeed very interesting and informative. Note that I was doing this experiment just because I am trying to learn Julia. Be sure I will remember this lesson for long !

Thanks you @mcabbott for this.

hlbnet · November 27, 2023, 8:51am

I have also tested @tverho idea to not use generator expressions.
Here is the new output:
Size=20000

If typeof (with generator)
1.329311 seconds
sum=998440000
If typeof (classical loop)
1.155876 seconds
sum=998440000
Multiple dispatch (with generator)
1.404892 seconds
sum=998440000
Multiple dispatch (classical loop)
1.112607 seconds
sum=998440000
If typeof (with generator)
1.360723 seconds
sum=998440000
If typeof (classical loop)
1.111058 seconds
sum=998440000
Multiple dispatch (with generator)
1.427805 seconds
sum=998440000
Multiple dispatch (classical loop)
1.146041 seconds
sum=998440000

Tests with classical loops are faster, but not as much as in @tverho exemple.
Still trying to understand why …
Here is my new code at this moment:

struct S1 end
struct S2 end

do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4

function do_typeof(left, right)
    if typeof(left) === S1
        if typeof(right) === S1
            return 1
        elseif typeof(right) === S2
            return 2
        end
    elseif typeof(left) === S2
        if typeof(right) === S1
            return 3
        elseif typeof(right) === S2
            return 4
        end
    end
end

function test(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)

    println("- If typeof (with generator)")
    @time s = sum(do_typeof(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- If typeof (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    println("  sum=", s)

    println("- Multiple dispatch (with generator)")
    @time s = sum(do_dispatch(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- Multiple dispatch (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
    println("  sum=", s)

    println("- If typeof (with generator)")
    @time s = sum(do_typeof(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- If typeof (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    println("  sum=", s)

    println("- Multiple dispatch (with generator)")
    @time s = sum(do_dispatch(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- Multiple dispatch (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
    println("  sum=", s)
end

test(20000)

tverho · November 27, 2023, 12:19pm

Oops, when I omitted the print statement the compiler seemed to optimize away the whole loop (or something). When I add the print statement I get the same result as you. So never mind :).

foobar_lv2 · November 27, 2023, 12:36pm

typeof(x) gives us a pointer to the type. This is just a memory load, because julia objects store a pointer to the type in their header.

Nice! This means that type comparisons are fast (pointer compare).

In order to figure out where a call goes, we need to collect all the arguments to the method, and then look that up by a complicated tree search.

No, just kidding. There is a small hashmap-cache in front of the tree search. So we just need to look into that hashmap. In trivial benchmarks (long loop, few dynamic dispatches) the cache is always hot, i.e. we immediately find the right entry. (So your benchmark is quite optimistic in that sense)

Alas, the entire code for this is written in C and is part of libjulia. You can look at it here.

Now ideally someone would sit down and intrinsify the fastpath of jl_apply_generic, such that llvm can optimize through it (and e.g. use the fact that world-age and arity and address are compile-time known constants).

It appears that nobody with the right skillset really cares about shaving off a couple of nanos from dynamic multi-dispatch, so it doesn’t get faster (same as e.g. push! to a vector).

edit: You might also ask: Shouldn’t the JIT see the right dispatch target on the first call? Then it should be able to emit code for a fastpath that checks whether the type is as expected, and otherwise calls into the slow thing.

Alas, nope. Julia is not really a JIT. We don’t support tiered compilation and profile-guided optimization / tracing with guards that deoptimize on surprises. These would be awesome features, but I don’t see it (llvm purports to support something in that direction, but afaiu it’s pretty half-hearted).

tverho · November 27, 2023, 12:46pm

Thanks for the answer!

Regarding push! to a vector, I saw news recently that efforts to rewrite Vector based on a low-level Memory type are underway, see This month in Julia world - 2023-10

ToucheSir · November 27, 2023, 4:25pm

foobar_lv2:

Now ideally someone would sit down and intrinsify the fastpath of jl_apply_generic, such that llvm can optimize through it (and e.g. use the fact that world-age and arity and address are compile-time known constants).

It appears that nobody with the right skillset really cares about shaving off a couple of nanos from dynamic multi-dispatch, so it doesn’t get faster (same as e.g. push! to a vector).

edit: You might also ask: Shouldn’t the JIT see the right dispatch target on the first call? Then it should be able to emit code for a fastpath that checks whether the type is as expected, and otherwise calls into the slow thing.

Alas, nope. Julia is not really a JIT. We don’t support tiered compilation and profile-guided optimization / tracing with guards that deoptimize on surprises. These would be awesome features, but I don’t see it (llvm purports to support something in that direction, but afaiu it’s pretty half-hearted).

The situation is not quite so dire as described here. [WIP] dynamic dispatch optimization: avoid boxing stack-allocated inputs by NHDaly · Pull Request #50136 · JuliaLang/julia · GitHub found that it is possible to eliminate almost all of the allocations incurred by dynamic dispatch by modifying the generic function calling convention. Unfortunately, implementing this properly would appear to require deep changes to the Julia runtime, and nobody else seems to be both willing and able to make said changes.

Topic		Replies	Views
How slow is runtime dispatch, anyway? Benchmark attempts Performance	8	339	January 15, 2025
Performance benefits of multiple dispatch Performance question , multidispatch	20	2515	December 19, 2021
Update on single dispatch benchmark comparison to C++ Performance cxx , benchmark	19	3060	February 13, 2021
Multiple dispatch with more than 4 methods blocks type inference Performance	15	1324	October 28, 2020
Allocation and slow down when # of types involved increase (slower than C++ virtual methods) Performance question , hep	57	2857	November 9, 2022

Speed of multiple dispatch

Related topics