concr
0.000758 seconds
abstr
0.027950 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
0.007648 seconds
union
0.008517 seconds
(sum=500000500000)
And output of Julia 0.6.1:
concr
0.000535 seconds
abstr
0.030308 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
0.008486 seconds
union
0.037312 seconds (999.49 k allocations: 15.251 MiB, 24.76% gc time)
(sum=500000500000)
Legend:
concr means no dispatch (concrete type known at compile time)
abstr means dispatching from an abstract (base) type reference
manual means dispatching “by hand” using if/else blocks over the return of typeof
union means dispatching from a union of the two concrete types
I’m very happy with the times for the Union case! I think we can say it is almost on par with C++ vtable.
Update: These times shall be taken with a (lot) grain of salt. See replies bellow.
Update 2: These times do not measure dispatching (at CUP instructions level) alone. They measure a possible algorithm structure using derivative types in more or less idiomatic approaches in two languages. As with almost all benchmarks, a lot of things counts for the timings: compilation optimizations (inlining), cache hittings, branch prediction, etc.
Do you have any ideas why the concr case got slower between v0.6.1 and v0.7 master?
A 42% slowdown, in what is the most common case, is rather worrying.
I’ll try running your code on my system, see if I can figure anything out.
This was very interesting information about C++ vs. Julia performance, good work!
I would suggest using the BenchmarkTools package to get more consistent timings, especially when timing fast things like the concrete dispatch. It should be sufficient to do:
abstract type AbstrType end
struct ConcrType1 <: AbstrType; x::Int; end
struct ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 1_000_000
@btime sum(f, $([ConcrType1(i) for i=1:n]));
@btime sum(f, $(AbstrType[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));
@btime sum(f, $(Union{ConcrType1, ConcrType2}[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));
to time the three main cases (concrete, abstract, union). You don’t need to put it into a function if you splice in non-const arguments with $, and you don’t need to manually write multiple f methods since these will be created automatically by the compiler.
I did a little bit of playing with this benchmark.
So, first note that the concrete case depends a lot on inlining. I think that C++ is forbidden from inlining virtual functions; once you add @noinline, the julia and C++ timings for concrete become comparable.
Second, you don’t really measure dispatch speed; you also have differing modes of array access (direct and indirect). This can be relieved by making your types mutable.
Third, I think that the timings are extremely unreliable, in the sense of depending very strongly on the state of the allocator on creation. Compare
abstract type AbstrType end
type ConcrType1 <: AbstrType; x::Int; end
type ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 10_000_000
gc()
const a1 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
const a2 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
shuffle!(a2)
sum(f, a1)
@time sum(f, a1)
@time sum(f, a2)
@time sum(f, a1)
@time sum(f, a2)
Can that be related to CPU branch prediction?
I use randomness when selecting the derived type of the object to try to avoid branch prediction, but maybe I should also do it for the array values ordering…
I’m getting very different results when running the code from command line (vs REPL)…
As some had said, these times must be taken with a grain of salt.
Well, a3 was built on the REPL. This slowed down construction and led to extra allocations (probably because the integer i is boxed?), which led to less contiguous memory layout. This made a3 a “reliably bad” array, in terms of cache prediction. a1 as pretty good, and a2 is “maximally bad”.
[edit: a3 was not built on the repl, but on the top level scope. same problem.]
I see… You are right.
But I’m not being too unfair to C++ (except for the inline part which I’ll correct) as for the concrete case it is a vector of values and for the abstract case it is a vector of pointers.
But I agree, this is not measuring the dispatch alone, it measuring a lot of suff (which I initially put to try to avoid branch prediction…)
I see. Lost of cache influence arises when using arrays for the benchmark…
Re inlining: I am unsure what a fair comparison would be.
I think that automatic inlining of small virtual functions is a really cool feature of julia over C++, and the devs deserve to brag about it (but I am not so good at C++, take it with a grain of salt). However, the price of it is not benchmarked here: Code duplication because julia forgets binary compatibility of subclasses.
If I subclass in C++, and decide not to overwrite a virtual func (or the function was not virtual anyway), then no new code is generated. This is because subclasses are binary compatible: If B is a subclass of A, then functions compiled for A can seamlessly operate one instances of B.
As far as I know there is no way to tell Julia “this function can operate on the following types, but please compile code only once, they are binary compatible (for the purpose of this function)”.
Now, the C++ way is used a lot in practice: I subtype something from a library, without needing / compiling source-code for the lib: I need a header (or reverse the data layout), and the linker will find the methods for me. Nifty!
I think this does not work in julia, and as far as I gathered was never a design goal. This is not a problem for me, but you have to admit that this is pretty cool.
Edit: Ok, using union-types it might (?) be possible to tell julia this. But this is not the point: The author of the original method knows nothing of my new type, and I as later type-author declare that methods applying to type A also apply to my new type, without re-compilation.
Here’s the result for Julia 1.5.3 running the above benchmark SingleDispatchJuliaVsCXX, in case anyone wants to have a look. (I know this is an old thread, but I got interested in comparing Julia’s dispatch against C++ which uses vtable under the hood, and Google directed me here.)
\concr
1.576 ms (0 allocations: 0 bytes)
abstr
43.943 ms (1999458 allocations: 30.51 MiB)
manual dispatch
11.056 ms (0 allocations: 0 bytes)
union
5.745 ms (0 allocations: 0 bytes)
The C++ result is
concr
0.001452 seconds
abstr
0.007142 seconds
Type unions look like the most practical and performant solution, but Julia’s poor timing for abstract inheritance seems to suggest that the paradigm of multiple dispatch does involve some performance trade-off for single dispatch.
I think in order for the julia code to be more comparable to the C++ code you should define two methods for f on ConcrType1 and ConcrType2, in addition to specifying the return type (which removes a type instability on the variable summed over):
One thing to keep in mind though, is that adding more methods doesn’t scale, performance wise. For example, adding a third method (which was missing from the above to actually make it comparable to the C++ code):
@noinline f(a::AbstrType)::Int = a.x
does not change the timings, but adding two more, say on a third & fourth type (without actually changing the arrays themselves):
struct ConcrType3 <: AbstrType; x::Int; end
struct ConcrType4 <: AbstrType; x::Int; end
@noinline f(a::ConcrType3)::Int = a.x
@noinline f(a::ConcrType4)::Int = a.x
and the compiler gives up.
Even though we have typed all four methods to return Int, julia will infer the return value as Any. At this point you can put a type annotation on the return of f at the callsite, to ensure the accumulator is type-stable. This doesn’t help dispatch however. The timings then become:
## concrete array (baseline):
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.217 ms (0.00% GC)
median time: 1.764 ms (0.00% GC)
mean time: 1.531 ms (0.00% GC)
maximum time: 3.194 ms (0.00% GC)
--------------
samples: 3257
evals/sample: 1
## auto dispatch on abstract array:
BenchmarkTools.Trial:
memory estimate: 15.25 MiB
allocs estimate: 999489
--------------
minimum time: 29.708 ms (0.00% GC)
median time: 32.189 ms (0.00% GC)
mean time: 33.763 ms (5.86% GC)
maximum time: 58.111 ms (9.83% GC)
--------------
samples: 149
evals/sample: 1
## auto dispatch on union array:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.453 ms (0.00% GC)
median time: 5.500 ms (0.00% GC)
mean time: 5.598 ms (0.00% GC)
maximum time: 7.433 ms (0.00% GC)
--------------
samples: 894
evals/sample: 1
## manual dispatch on abstract array:
BenchmarkTools.Trial:
memory estimate: 30.51 MiB
allocs estimate: 1999458
--------------
minimum time: 33.954 ms (0.00% GC)
median time: 40.313 ms (14.11% GC)
mean time: 39.675 ms (10.19% GC)
maximum time: 52.661 ms (12.24% GC)
--------------
samples: 127
evals/sample: 1
## manual dispatch on union array:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.914 ms (0.00% GC)
median time: 6.048 ms (0.00% GC)
mean time: 6.727 ms (0.00% GC)
maximum time: 32.357 ms (0.00% GC)
--------------
samples: 743
evals/sample: 1