Hello,
In the following example, multiple dispatch is more that 10 times slower than typeof comparisons.
Is it expected, or is there some issue in my code ?
Ouput:
Size=20000
If typeof
1006900000
1.576137 seconds (5.35 k allocations: 403.821 KiB, 0.83% compilation time)
Multiple dispatch
1006900000
21.467297 seconds (400.01 M allocations: 5.961 GiB, 1.67% gc time, 0.17% compilation time)
If typeof
1006900000
1.537739 seconds (8 allocations: 240 bytes)
Multiple dispatch
1006900000
21.652412 seconds (400.00 M allocations: 5.961 GiB, 1.36% gc time, 0.00% compilation time)
The code:
struct S1 end
struct S2 end
do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4
function do_typeof(left, right)
if typeof(left) === S1
if typeof(right) === S1
return 1
elseif typeof(right) === S2
return 2
end
elseif typeof(left) === S2
if typeof(right) === S1
return 3
elseif typeof(right) === S2
return 4
end
end
end
function test(size)
println("Size=", size)
items = rand([S1(), S2()], size)
println("- If typeof")
@time println(sum(do_typeof(x, y) for x in items, y in items))
println("- Multiple dispatch")
@time println(sum(do_dispatch(x, y) for x in items, y in items))
println("- If typeof")
@time println(sum(do_typeof(x, y) for x in items, y in items))
println("- Multiple dispatch")
@time println(sum(do_dispatch(x, y) for x in items, y in items))
end
test(20000)
You didn’t include the code how you did the benchmarking? The issue seems to lie there, the first hint is the allocations that shouldn’t be there. This is how it should be done:
using BenchmarkTools
left = S1()
right = S2()
julia> @btime do_dispatch($left, $right)
1.015 ns (0 allocations: 0 bytes)
2
julia> @btime do_typeof($left, $right)
1.015 ns (0 allocations: 0 bytes)
2
The speed is the same because the compiler generates identical code for both:
julia> @code_llvm do_dispatch(left, right)
; @ REPL[4]:1 within `do_dispatch`
define i64 @julia_do_dispatch_332() #0 {
top:
ret i64 2
}
julia> @code_llvm do_typeof(left, right)
; @ REPL[7]:1 within `do_typeof`
define i64 @julia_do_typeof_341() #0 {
top:
ret i64 2
}
``
Note that items is of type Vector{Any}. With a narrower type, both paths are faster, the dispatch code equally fast. (Only for sufficiently small unions.)
I believe that the reason dynamic dispatch is slower is that the code compiled for Vector{Any}) cannot exploit the knowledge that do_dispatch has exactly 4 methods. This may change (e.g. you could define struct S3) and the I don’t think the compiled code will be discarded.
Whereas the code for Vector{Union{S1, S2}} is going to be essentially the same branches you wrote, probably inlining do_dispatch. This code can never encounter some new S3, and I believe it will be thrown away if you change the definitions of do_dispatch it uses.
julia> function test(size)
println("Size=", size)
items = rand(Union{S1,S2}[S1(), S2()], size)
println("- If typeof")
@time println(sum(do_typeof(x, y) for x in items, y in items))
println("- Multiple dispatch")
@time println(sum(do_dispatch(x, y) for x in items, y in items))
println("- If typeof")
@time println(sum(do_typeof(x, y) for x in items, y in items))
println("- Multiple dispatch")
@time println(sum(do_dispatch(x, y) for x in items, y in items))
end
test (generic function with 1 method)
julia> test(1000)
Size=1000
- If typeof
2491000
0.002022 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
0.001944 seconds (8 allocations: 232 bytes)
- If typeof
2491000
0.002671 seconds (8 allocations: 232 bytes)
- Multiple dispatch
2491000
0.001947 seconds (8 allocations: 232 bytes)
julia> test(20000)
Size=20000
- If typeof
994060000
1.531119 seconds (9 allocations: 544 bytes)
- Multiple dispatch
994060000
1.514052 seconds (8 allocations: 240 bytes)
- If typeof
994060000
1.585021 seconds (8 allocations: 240 bytes)
- Multiple dispatch
994060000
1.542702 seconds (8 allocations: 240 bytes)
Make sure to do a warm up run on test or use BenchmarkTools.
I just want to say that @hlbnet’s original benchmarks are perfectly sound, reproducible, included both a warmup and second evaluation, and documented the proportion of time spent on compilation (which happened to be negligible).
Thank you @mcabbott for providing an accurate answer, and apologies to @hlbnet for the community’s misplaced criticism of your benchmarks. We get a lot of new users who report strange benchmark results due to poor benchmarking and it’s an understandable though unfortunate mistake to lump you in with those folks.
My mistake, I didn’t hover on the code so I didn’t see the scroll bar and the test function…
Also, I perhaps missed the point a little bit if the idea was to test dynamic dispatch where the compiler doesn’t know in advance which function will be called. As mcabbott noticed, the dispatch version slows down and allocates a lot if the the vector type doesn’t contain any information on the types. However, the version using conditions is not affected. It is not at all obvious why this should be the case? Why can’t the dispatch system do something equivalent to making typeof tests to avoid allocations and a dramatic slowdown?
So although a simple fix exists (using a narrower type for the vector) the question remains why the dispatch version does so poorly compared to the reference in this particular case?
(EDIT: as shown below, this seems to be a false result because of compiler optimizing away the loop when I didn’t print out the result)
There is also other factors here affecting the performance… In my experience, generator expressions and closures don’t always work as you would hope performace-wise, and the allocations suggested something might be going on. So I ran a test using an old-fashioned for-loop and got another order of magnitude reduction in run time:
function test(size)
println("Size=", size)
items = rand(Union{S1,S2}[S1(), S2()], size) # Note the narrower type
println("- If typeof")
@time println(sum(do_typeof(x, y) for x in items, y in items))
println("- Multiple dispatch")
@time println(sum(do_dispatch(x, y) for x in items, y in items))
end
function test_with_for(size)
println("Size=", size)
items = rand(Union{S1,S2}[S1(), S2()], size)
println("- If typeof")
s = 0
@time for x in items, y in items
s += do_typeof(x, y)
end
println("- Multiple dispatch")
s = 0
@time for x in items, y in items
s += do_dispatch(x, y)
end
end
# Compile
test(20000)
test_with_for(20000)
julia> test(20000)
Size=20000
- If typeof
997900000
1.458255 seconds (9 allocations: 544 bytes)
- Multiple dispatch
997900000
1.488013 seconds (8 allocations: 240 bytes)
julia> test_with_for(20000)
Size=20000
- If typeof
0.212348 seconds
- Multiple dispatch
0.211795 seconds
I have replaced the following line:
items = rand([S1(), S2()], size)
By this one:
items = rand(Union{S1,S2}[S1(), S2()], size)
The test with multiple dispatch is then dramatically speed up, achieving same performance as the typeof test.
That is indeed very interesting and informative. Note that I was doing this experiment just because I am trying to learn Julia. Be sure I will remember this lesson for long !
Tests with classical loops are faster, but not as much as in @tverho exemple.
Still trying to understand why …
Here is my new code at this moment:
struct S1 end
struct S2 end
do_dispatch(left::S1, right::S1) = 1
do_dispatch(left::S1, right::S2) = 2
do_dispatch(left::S2, right::S1) = 3
do_dispatch(left::S2, right::S2) = 4
function do_typeof(left, right)
if typeof(left) === S1
if typeof(right) === S1
return 1
elseif typeof(right) === S2
return 2
end
elseif typeof(left) === S2
if typeof(right) === S1
return 3
elseif typeof(right) === S2
return 4
end
end
end
function test(size)
println("Size=", size)
items = rand(Union{S1,S2}[S1(), S2()], size)
println("- If typeof (with generator)")
@time s = sum(do_typeof(x, y) for x in items, y in items)
println(" sum=", s)
println("- If typeof (classical loop)")
s = 0
@time for x in items, y in items
s += do_typeof(x, y)
end
println(" sum=", s)
println("- Multiple dispatch (with generator)")
@time s = sum(do_dispatch(x, y) for x in items, y in items)
println(" sum=", s)
println("- Multiple dispatch (classical loop)")
s = 0
@time for x in items, y in items
s += do_dispatch(x, y)
end
println(" sum=", s)
println("- If typeof (with generator)")
@time s = sum(do_typeof(x, y) for x in items, y in items)
println(" sum=", s)
println("- If typeof (classical loop)")
s = 0
@time for x in items, y in items
s += do_typeof(x, y)
end
println(" sum=", s)
println("- Multiple dispatch (with generator)")
@time s = sum(do_dispatch(x, y) for x in items, y in items)
println(" sum=", s)
println("- Multiple dispatch (classical loop)")
s = 0
@time for x in items, y in items
s += do_dispatch(x, y)
end
println(" sum=", s)
end
test(20000)
Oops, when I omitted the print statement the compiler seemed to optimize away the whole loop (or something). When I add the print statement I get the same result as you. So never mind :).
typeof(x) gives us a pointer to the type. This is just a memory load, because julia objects store a pointer to the type in their header.
Nice! This means that type comparisons are fast (pointer compare).
In order to figure out where a call goes, we need to collect all the arguments to the method, and then look that up by a complicated tree search.
No, just kidding. There is a small hashmap-cache in front of the tree search. So we just need to look into that hashmap. In trivial benchmarks (long loop, few dynamic dispatches) the cache is always hot, i.e. we immediately find the right entry. (So your benchmark is quite optimistic in that sense)
Alas, the entire code for this is written in C and is part of libjulia. You can look at it here.
Now ideally someone would sit down and intrinsify the fastpath of jl_apply_generic, such that llvm can optimize through it (and e.g. use the fact that world-age and arity and address are compile-time known constants).
It appears that nobody with the right skillset really cares about shaving off a couple of nanos from dynamic multi-dispatch, so it doesn’t get faster (same as e.g. push! to a vector).
edit: You might also ask: Shouldn’t the JIT see the right dispatch target on the first call? Then it should be able to emit code for a fastpath that checks whether the type is as expected, and otherwise calls into the slow thing.
Alas, nope. Julia is not really a JIT. We don’t support tiered compilation and profile-guided optimization / tracing with guards that deoptimize on surprises. These would be awesome features, but I don’t see it (llvm purports to support something in that direction, but afaiu it’s pretty half-hearted).
Regarding push! to a vector, I saw news recently that efforts to rewrite Vector based on a low-level Memory type are underway, see This month in Julia world - 2023-10