Speed of multiple dispatch

Hello,
In the following example, multiple dispatch is more that 10 times slower than typeof comparisons.

Is it expected, or is there some issue in my code ?

Ouput:
Size=20000

  • If typeof
    1006900000
    1.576137 seconds (5.35 k allocations: 403.821 KiB, 0.83% compilation time)
  • Multiple dispatch
    1006900000
    21.467297 seconds (400.01 M allocations: 5.961 GiB, 1.67% gc time, 0.17% compilation time)
  • If typeof
    1006900000
    1.537739 seconds (8 allocations: 240 bytes)
  • Multiple dispatch
    1006900000
    21.652412 seconds (400.00 M allocations: 5.961 GiB, 1.36% gc time, 0.00% compilation time)

The code:

struct S1 end
struct S2 end

do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4

function do_typeof(left, right)
    if typeof(left) === S1
        if typeof(right) === S1
            return 1
        elseif typeof(right) === S2
            return 2
        end
    elseif typeof(left) === S2
        if typeof(right) === S1
            return 3
        elseif typeof(right) === S2
            return 4
        end
    end
end

function test(size)
    println("Size=", size)
    items = rand([S1(), S2()], size)

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))
end

test(20000)
1 Like

You didn’t include the code how you did the benchmarking? The issue seems to lie there, the first hint is the allocations that shouldn’t be there. This is how it should be done:

using BenchmarkTools
left = S1()
right = S2()

julia> @btime do_dispatch($left, $right)
  1.015 ns (0 allocations: 0 bytes)
2

julia> @btime do_typeof($left, $right)
  1.015 ns (0 allocations: 0 bytes)
2

The speed is the same because the compiler generates identical code for both:

julia> @code_llvm do_dispatch(left, right)
;  @ REPL[4]:1 within `do_dispatch`
define i64 @julia_do_dispatch_332() #0 {
top:
  ret i64 2
}

julia> @code_llvm do_typeof(left, right)
;  @ REPL[7]:1 within `do_typeof`
define i64 @julia_do_typeof_341() #0 {
top:
  ret i64 2
}
``

They did include the code, you probably didn’t scroll down in their code block.

The performance difference seems reasonable, unfortunately multiple dispatch is just not that fast.

1 Like

It looks like they did the classic mistake of including compile time in the benchmarks.

There also seems to be something off with the allocations.

Note that items is of type Vector{Any}. With a narrower type, both paths are faster, the dispatch code equally fast. (Only for sufficiently small unions.)

julia> rand([S1(), S2()], 3)
3-element Vector{Any}:
 S2()
 S1()
 S2()

julia> rand(Union{S1,S2}[S1(), S2()], 3)
3-element Vector{Union{S1, S2}}:
 S1()
 S2()
 S2()

I believe that the reason dynamic dispatch is slower is that the code compiled for Vector{Any}) cannot exploit the knowledge that do_dispatch has exactly 4 methods. This may change (e.g. you could define struct S3) and the I don’t think the compiled code will be discarded.

Whereas the code for Vector{Union{S1, S2}} is going to be essentially the same branches you wrote, probably inlining do_dispatch. This code can never encounter some new S3, and I believe it will be thrown away if you change the definitions of do_dispatch it uses.

8 Likes

@mcabbott beat me to it. His answer is correct.

julia> function test(size)
           println("Size=", size)
           items = rand(Union{S1,S2}[S1(), S2()], size)

           println("- If typeof")
           @time println(sum(do_typeof(x, y) for x in items, y in items))

           println("- Multiple dispatch")
           @time println(sum(do_dispatch(x, y) for x in items, y in items))

           println("- If typeof")
           @time println(sum(do_typeof(x, y) for x in items, y in items))

           println("- Multiple dispatch")
           @time println(sum(do_dispatch(x, y) for x in items, y in items))
       end
test (generic function with 1 method)

julia> test(1000)
Size=1000
- If typeof
2491000
  0.002022 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
  0.001944 seconds (8 allocations: 232 bytes)
- If typeof
2491000
  0.002671 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
  0.001947 seconds (8 allocations: 232 bytes)

julia> test(20000)
Size=20000
- If typeof
994060000
  1.531119 seconds (9 allocations: 544 bytes)
- Multiple dispatch
994060000
  1.514052 seconds (8 allocations: 240 bytes)
- If typeof
994060000
  1.585021 seconds (8 allocations: 240 bytes)
- Multiple dispatch
994060000
  1.542702 seconds (8 allocations: 240 bytes)

Make sure to do a warm up run on test or use BenchmarkTools.

1 Like

I just want to say that @hlbnet’s original benchmarks are perfectly sound, reproducible, included both a warmup and second evaluation, and documented the proportion of time spent on compilation (which happened to be negligible).

Thank you @mcabbott for providing an accurate answer, and apologies to @hlbnet for the community’s misplaced criticism of your benchmarks. We get a lot of new users who report strange benchmark results due to poor benchmarking and it’s an understandable though unfortunate mistake to lump you in with those folks.

15 Likes

My mistake, I didn’t hover on the code so I didn’t see the scroll bar and the test function…

Also, I perhaps missed the point a little bit if the idea was to test dynamic dispatch where the compiler doesn’t know in advance which function will be called. As mcabbott noticed, the dispatch version slows down and allocates a lot if the the vector type doesn’t contain any information on the types. However, the version using conditions is not affected. It is not at all obvious why this should be the case? Why can’t the dispatch system do something equivalent to making typeof tests to avoid allocations and a dramatic slowdown?

So although a simple fix exists (using a narrower type for the vector) the question remains why the dispatch version does so poorly compared to the reference in this particular case?

1 Like

(EDIT: as shown below, this seems to be a false result because of compiler optimizing away the loop when I didn’t print out the result)

There is also other factors here affecting the performance… In my experience, generator expressions and closures don’t always work as you would hope performace-wise, and the allocations suggested something might be going on. So I ran a test using an old-fashioned for-loop and got another order of magnitude reduction in run time:

function test(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)  # Note the narrower type

    println("- If typeof")
    @time println(sum(do_typeof(x, y) for x in items, y in items))

    println("- Multiple dispatch")
    @time println(sum(do_dispatch(x, y) for x in items, y in items))
end

function test_with_for(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)

    println("- If typeof")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    
    println("- Multiple dispatch")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
end
    
# Compile
test(20000)
test_with_for(20000)

julia> test(20000)
Size=20000
- If typeof
997900000
  1.458255 seconds (9 allocations: 544 bytes)
- Multiple dispatch
997900000
  1.488013 seconds (8 allocations: 240 bytes)

julia> test_with_for(20000)
Size=20000
- If typeof
  0.212348 seconds
- Multiple dispatch
  0.211795 seconds
1 Like

I have replaced the following line:
items = rand([S1(), S2()], size)

By this one:
items = rand(Union{S1,S2}[S1(), S2()], size)

The test with multiple dispatch is then dramatically speed up, achieving same performance as the typeof test.

That is indeed very interesting and informative. Note that I was doing this experiment just because I am trying to learn Julia. Be sure I will remember this lesson for long !

Thanks you @mcabbott for this.

3 Likes

I have also tested @tverho idea to not use generator expressions.
Here is the new output:
Size=20000

  • If typeof (with generator)
    1.329311 seconds
    sum=998440000
  • If typeof (classical loop)
    1.155876 seconds
    sum=998440000
  • Multiple dispatch (with generator)
    1.404892 seconds
    sum=998440000
  • Multiple dispatch (classical loop)
    1.112607 seconds
    sum=998440000
  • If typeof (with generator)
    1.360723 seconds
    sum=998440000
  • If typeof (classical loop)
    1.111058 seconds
    sum=998440000
  • Multiple dispatch (with generator)
    1.427805 seconds
    sum=998440000
  • Multiple dispatch (classical loop)
    1.146041 seconds
    sum=998440000

Tests with classical loops are faster, but not as much as in @tverho exemple.
Still trying to understand why …
Here is my new code at this moment:

struct S1 end
struct S2 end

do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4

function do_typeof(left, right)
    if typeof(left) === S1
        if typeof(right) === S1
            return 1
        elseif typeof(right) === S2
            return 2
        end
    elseif typeof(left) === S2
        if typeof(right) === S1
            return 3
        elseif typeof(right) === S2
            return 4
        end
    end
end

function test(size)
    println("Size=", size)
    items = rand(Union{S1,S2}[S1(), S2()], size)

    println("- If typeof (with generator)")
    @time s = sum(do_typeof(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- If typeof (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    println("  sum=", s)

    println("- Multiple dispatch (with generator)")
    @time s = sum(do_dispatch(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- Multiple dispatch (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
    println("  sum=", s)

    println("- If typeof (with generator)")
    @time s = sum(do_typeof(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- If typeof (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_typeof(x, y)
    end
    println("  sum=", s)

    println("- Multiple dispatch (with generator)")
    @time s = sum(do_dispatch(x, y) for x in items, y in items)
    println("  sum=", s)

    println("- Multiple dispatch (classical loop)")
    s = 0
    @time for x in items, y in items
        s += do_dispatch(x, y)
    end
    println("  sum=", s)
end

test(20000)

Oops, when I omitted the print statement the compiler seemed to optimize away the whole loop (or something). When I add the print statement I get the same result as you. So never mind :).

typeof(x) gives us a pointer to the type. This is just a memory load, because julia objects store a pointer to the type in their header.

Nice! This means that type comparisons are fast (pointer compare).

In order to figure out where a call goes, we need to collect all the arguments to the method, and then look that up by a complicated tree search.

No, just kidding. There is a small hashmap-cache in front of the tree search. So we just need to look into that hashmap. In trivial benchmarks (long loop, few dynamic dispatches) the cache is always hot, i.e. we immediately find the right entry. (So your benchmark is quite optimistic in that sense)

Alas, the entire code for this is written in C and is part of libjulia. You can look at it here.

Now ideally someone would sit down and intrinsify the fastpath of jl_apply_generic, such that llvm can optimize through it (and e.g. use the fact that world-age and arity and address are compile-time known constants).

It appears that nobody with the right skillset really cares about shaving off a couple of nanos from dynamic multi-dispatch, so it doesn’t get faster (same as e.g. push! to a vector).

edit: You might also ask: Shouldn’t the JIT see the right dispatch target on the first call? Then it should be able to emit code for a fastpath that checks whether the type is as expected, and otherwise calls into the slow thing.

Alas, nope. Julia is not really a JIT. We don’t support tiered compilation and profile-guided optimization / tracing with guards that deoptimize on surprises. These would be awesome features, but I don’t see it (llvm purports to support something in that direction, but afaiu it’s pretty half-hearted).

2 Likes

Thanks for the answer!

Regarding push! to a vector, I saw news recently that efforts to rewrite Vector based on a low-level Memory type are underway, see This month in Julia world - 2023-10

2 Likes

The situation is not quite so dire as described here. [WIP] dynamic dispatch optimization: avoid boxing stack-allocated inputs by NHDaly · Pull Request #50136 · JuliaLang/julia · GitHub found that it is possible to eliminate almost all of the allocations incurred by dynamic dispatch by modifying the generic function calling convention. Unfortunately, implementing this properly would appear to require deep changes to the Julia runtime, and nobody else seems to be both willing and able to make said changes.

4 Likes