Update on single dispatch benchmark comparison to C++

cdsousa · November 24, 2017, 11:09am

Hi,

I’m making an update of the benchmarks I had ran back in 2015, comparing the performance of Julia dispatch on single argument against C++ dispatch: Redirecting to Google Groups
The code is in here: https://github.com/cdsousa/SingleDispatchJuliaVsCXX.

Output of C++:

concr
0.001477 seconds
abstr
0.007114 seconds
(sum=500000500000)

Output of Julia Nighty (0.7.0-DEV.2587):

concr
  0.000758 seconds
abstr
  0.027950 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
  0.007648 seconds
union
  0.008517 seconds
(sum=500000500000)

And output of Julia 0.6.1:

concr
  0.000535 seconds
abstr
  0.030308 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
  0.008486 seconds
union
  0.037312 seconds (999.49 k allocations: 15.251 MiB, 24.76% gc time)
(sum=500000500000)

Legend:

concr means no dispatch (concrete type known at compile time)
abstr means dispatching from an abstract (base) type reference
manual means dispatching “by hand” using if/else blocks over the return of typeof
union means dispatching from a union of the two concrete types

I’m very happy with the times for the Union case! I think we can say it is almost on par with C++ ~~vtable~~.

Update: These times shall be taken with a (lot) grain of salt. See replies bellow.

Update 2: These times do not measure dispatching (at CUP instructions level) alone. They measure a possible algorithm structure using derivative types in more or less idiomatic approaches in two languages. As with almost all benchmarks, a lot of things counts for the timings: compilation optimizations (inlining), cache hittings, branch prediction, etc.

Update 3: See some new measures in replies below.

ScottPJones · November 24, 2017, 1:22pm

Do you have any ideas why the concr case got slower between v0.6.1 and v0.7 master?
A 42% slowdown, in what is the most common case, is rather worrying.

cdsousa · November 24, 2017, 1:31pm

Those times have a variation of 10 or 20%, even so, the difference seems to be meaningful.
But I have no clue for the reason for that.

ScottPJones · November 24, 2017, 1:39pm

I’ll try running your code on my system, see if I can figure anything out.
This was very interesting information about C++ vs. Julia performance, good work!

stevengj · November 24, 2017, 1:54pm

I would suggest using the BenchmarkTools package to get more consistent timings, especially when timing fast things like the concrete dispatch. It should be sufficient to do:

abstract type AbstrType end
struct ConcrType1 <: AbstrType; x::Int; end
struct ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 1_000_000
@btime sum(f, $([ConcrType1(i) for i=1:n]));
@btime sum(f, $(AbstrType[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));
@btime sum(f, $(Union{ConcrType1, ConcrType2}[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));

to time the three main cases (concrete, abstract, union). You don’t need to put it into a function if you splice in non-const arguments with $, and you don’t need to manually write multiple f methods since these will be created automatically by the compiler.

cdsousa · November 24, 2017, 2:43pm

I agree @stevengj . Plus, your code is elegantly compact!

I’ll probably update the code, but I’ll also keep naive version as it is the way I’m doing it in C++.

foobar_lv2 · November 24, 2017, 3:03pm

I did a little bit of playing with this benchmark.

So, first note that the concrete case depends a lot on inlining. I think that C++ is forbidden from inlining virtual functions; once you add @noinline, the julia and C++ timings for concrete become comparable.

Second, you don’t really measure dispatch speed; you also have differing modes of array access (direct and indirect). This can be relieved by making your types mutable.

Third, I think that the timings are extremely unreliable, in the sense of depending very strongly on the state of the allocator on creation. Compare

abstract type AbstrType end
type ConcrType1 <: AbstrType; x::Int; end
type ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 10_000_000
gc()
const a1 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
const a2 =  Vector{ConcrType1}([ConcrType1(i) for i=1:n])
shuffle!(a2)

sum(f, a1)
@time sum(f, a1)
@time sum(f, a2)

@time sum(f, a1)
@time sum(f, a2)

for an output of

  0.029130 seconds (5 allocations: 176 bytes)
  0.157367 seconds (5 allocations: 176 bytes)
  0.027281 seconds (5 allocations: 176 bytes)
  0.157395 seconds (5 allocations: 176 bytes)

cdsousa · November 24, 2017, 3:10pm

Can that be related to CPU branch prediction?
I use randomness when selecting the derived type of the object to try to avoid branch prediction, but maybe I should also do it for the array values ordering…

cdsousa · November 24, 2017, 3:18pm

I’m getting very different results when running the code from command line (vs REPL)…
As some had said, these times must be taken with a grain of salt.

foobar_lv2 · November 24, 2017, 3:39pm

Regarding memory allocation: Consider

abstract type AbstrType end
type ConcrType1 <: AbstrType; x::Int; end
type ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 10_000_000


gc()
@time const a1 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
@time const a2 =  Vector{ConcrType1}([ConcrType1(i) for i=1:n])
@time shuffle!(a2)
a3 = Vector{ConcrType1}()
@time for i=1:n
    push!(a3, ConcrType1(i) )
end

function main(a1,a2,a3)
    println("a1")
    @time sum(f, a1)
    println("a2")
    @time sum(f, a2)
    println("a3")
    @time sum(f, a3)
    println("------")
end

main(a1,a2,a3)
main(a1,a2,a3)
main(a1,a2,a3)

For an output of

$ julia jtest2.jl
  0.550571 seconds (10.01 M allocations: 229.755 MiB, 62.68% gc time)
  0.545317 seconds (10.00 M allocations: 228.882 MiB, 69.09% gc time)
  0.449771 seconds (6.95 k allocations: 387.315 KiB)
 40.855672 seconds (40.00 M allocations: 891.983 MiB, 6.86% gc time)
a1
  0.023915 seconds
a2
  0.157738 seconds
a3
  0.060393 seconds
------
a1
  0.023734 seconds
a2
  0.156797 seconds
a3
  0.060907 seconds
------
a1
  0.023627 seconds
a2
  0.156888 seconds
a3
  0.060060 seconds
------

Ok, what happened between a1 and a3?

Well, a3 was built on the REPL. This slowed down construction and led to extra allocations (probably because the integer i is boxed?), which led to less contiguous memory layout. This made a3 a “reliably bad” array, in terms of cache prediction. a1 as pretty good, and a2 is “maximally bad”.

[edit: a3 was not built on the repl, but on the top level scope. same problem.]

cdsousa · November 24, 2017, 3:44pm

I see… You are right.
But I’m not being too unfair to C++ (except for the inline part which I’ll correct) as for the concrete case it is a vector of values and for the abstract case it is a vector of pointers.
But I agree, this is not measuring the dispatch alone, it measuring a lot of suff (which I initially put to try to avoid branch prediction…)

I see. Lost of cache influence arises when using arrays for the benchmark…

cdsousa · November 24, 2017, 5:07pm

Times for a new version taking into account some comments above.

Output of C++:

concr
0.001477 seconds
abstr
0.007114 seconds

Output of Julia Nighty (0.7.0-DEV.2587):

concr
  1.556 ms (0 allocations: 0 bytes)
abstr
  43.301 ms (1999458 allocations: 30.51 MiB)
manual dispatch
  8.896 ms (0 allocations: 0 bytes)
union
  8.632 ms (0 allocations: 0 bytes)

Output of Julia 0.6.1:

concr
  1.573 ms (0 allocations: 0 bytes)
abstr
  61.823 ms (1999458 allocations: 30.51 MiB)
manual dispatch
  9.297 ms (0 allocations: 0 bytes)
union
  19.702 ms (999489 allocations: 15.25 MiB)

Juan · November 30, 2017, 4:06pm

It would be nice to see a graph with the evolution of the speed through different Julia versions.

foobar_lv2 · November 30, 2017, 5:05pm

Re inlining: I am unsure what a fair comparison would be.

I think that automatic inlining of small virtual functions is a really cool feature of julia over C++, and the devs deserve to brag about it (but I am not so good at C++, take it with a grain of salt). However, the price of it is not benchmarked here: Code duplication because julia forgets binary compatibility of subclasses.

Re Juan: I agree, such a graph would be cool.

StefanKarpinski · December 2, 2017, 5:35pm

I’m not sure what this means. Can you clarify?

foobar_lv2 · December 2, 2017, 5:50pm

Sure, but please correct me if I’m wrong.

If I subclass in C++, and decide not to overwrite a virtual func (or the function was not virtual anyway), then no new code is generated. This is because subclasses are binary compatible: If B is a subclass of A, then functions compiled for A can seamlessly operate one instances of B.

As far as I know there is no way to tell Julia “this function can operate on the following types, but please compile code only once, they are binary compatible (for the purpose of this function)”.

Now, the C++ way is used a lot in practice: I subtype something from a library, without needing / compiling source-code for the lib: I need a header (or reverse the data layout), and the linker will find the methods for me. Nifty!

I think this does not work in julia, and as far as I gathered was never a design goal. This is not a problem for me, but you have to admit that this is pretty cool.

Edit: Ok, using union-types it might (?) be possible to tell julia this. But this is not the point: The author of the original method knows nothing of my new type, and I as later type-author declare that methods applying to type A also apply to my new type, without re-compilation.

ChrisRackauckas · December 2, 2017, 6:59pm

greatpet · February 13, 2021, 1:36pm

Here’s the result for Julia 1.5.3 running the above benchmark SingleDispatchJuliaVsCXX, in case anyone wants to have a look. (I know this is an old thread, but I got interested in comparing Julia’s dispatch against C++ which uses vtable under the hood, and Google directed me here.)

\concr
1.576 ms (0 allocations: 0 bytes)
abstr
43.943 ms (1999458 allocations: 30.51 MiB)
manual dispatch
11.056 ms (0 allocations: 0 bytes)
union
5.745 ms (0 allocations: 0 bytes)

The C++ result is
concr
0.001452 seconds
abstr
0.007142 seconds

Type unions look like the most practical and performant solution, but Julia’s poor timing for abstract inheritance seems to suggest that the paradigm of multiple dispatch does involve some performance trade-off for single dispatch.

jebej · February 13, 2021, 5:13pm

I think in order for the julia code to be more comparable to the C++ code you should define two methods for f on ConcrType1 and ConcrType2, in addition to specifying the return type (which removes a type instability on the variable summed over):

@noinline f(a::ConcrType1)::Int = a.x
@noinline f(a::ConcrType2)::Int = a.x

If you do this, running the file here I obtain:

❯ julia abstractdispatch.jl

## concrete array (baseline):
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.216 ms (0.00% GC)
  median time:      1.789 ms (0.00% GC)
  mean time:        1.828 ms (0.00% GC)
  maximum time:     20.779 ms (0.00% GC)
  --------------
  samples:          2731
  evals/sample:     1

## auto dispatch on abstract array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.144 ms (0.00% GC)
  median time:      7.198 ms (0.00% GC)
  mean time:        7.386 ms (0.00% GC)
  maximum time:     14.874 ms (0.00% GC)
  --------------
  samples:          676
  evals/sample:     1

## auto dispatch on union array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.260 ms (0.00% GC)
  median time:      5.288 ms (0.00% GC)
  mean time:        5.387 ms (0.00% GC)
  maximum time:     7.791 ms (0.00% GC)
  --------------
  samples:          928
  evals/sample:     1

## manual dispatch on abstract array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.265 ms (0.00% GC)
  median time:      7.330 ms (0.00% GC)
  mean time:        7.517 ms (0.00% GC)
  maximum time:     17.913 ms (0.00% GC)
  --------------
  samples:          664
  evals/sample:     1

## manual dispatch on union array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.026 ms (0.00% GC)
  median time:      6.246 ms (0.00% GC)
  mean time:        6.814 ms (0.00% GC)
  maximum time:     34.952 ms (0.00% GC)
  --------------
  samples:          734
  evals/sample:     1

jebej · February 13, 2021, 5:47pm

One thing to keep in mind though, is that adding more methods doesn’t scale, performance wise. For example, adding a third method (which was missing from the above to actually make it comparable to the C++ code):

@noinline f(a::AbstrType)::Int = a.x

does not change the timings, but adding two more, say on a third & fourth type (without actually changing the arrays themselves):

struct ConcrType3 <: AbstrType; x::Int; end
struct ConcrType4 <: AbstrType; x::Int; end
@noinline f(a::ConcrType3)::Int = a.x
@noinline f(a::ConcrType4)::Int = a.x

and the compiler gives up.

Even though we have typed all four methods to return Int, julia will infer the return value as Any. At this point you can put a type annotation on the return of f at the callsite, to ensure the accumulator is type-stable. This doesn’t help dispatch however. The timings then become:

## concrete array (baseline):
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.217 ms (0.00% GC)
  median time:      1.764 ms (0.00% GC)
  mean time:        1.531 ms (0.00% GC)
  maximum time:     3.194 ms (0.00% GC)
  --------------
  samples:          3257
  evals/sample:     1

## auto dispatch on abstract array:
BenchmarkTools.Trial: 
  memory estimate:  15.25 MiB
  allocs estimate:  999489
  --------------
  minimum time:     29.708 ms (0.00% GC)
  median time:      32.189 ms (0.00% GC)
  mean time:        33.763 ms (5.86% GC)
  maximum time:     58.111 ms (9.83% GC)
  --------------
  samples:          149
  evals/sample:     1

## auto dispatch on union array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.453 ms (0.00% GC)
  median time:      5.500 ms (0.00% GC)
  mean time:        5.598 ms (0.00% GC)
  maximum time:     7.433 ms (0.00% GC)
  --------------
  samples:          894
  evals/sample:     1

## manual dispatch on abstract array:
BenchmarkTools.Trial: 
  memory estimate:  30.51 MiB
  allocs estimate:  1999458
  --------------
  minimum time:     33.954 ms (0.00% GC)
  median time:      40.313 ms (14.11% GC)
  mean time:        39.675 ms (10.19% GC)
  maximum time:     52.661 ms (12.24% GC)
  --------------
  samples:          127
  evals/sample:     1

## manual dispatch on union array:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.914 ms (0.00% GC)
  median time:      6.048 ms (0.00% GC)
  mean time:        6.727 ms (0.00% GC)
  maximum time:     32.357 ms (0.00% GC)
  --------------
  samples:          743
  evals/sample:     1

Topic		Replies	Views
How slow is runtime dispatch, anyway? Benchmark attempts Performance	8	338	January 15, 2025
Union splitting vs C++ Performance	22	4219	July 9, 2021
Speed of multiple dispatch Performance	14	1198	November 27, 2023
Allocation and slow down when # of types involved increase (slower than C++ virtual methods) Performance question , hep	57	2830	November 9, 2022
Performance benefits of multiple dispatch Performance question , multidispatch	20	2513	December 19, 2021

Update on single dispatch benchmark comparison to C++

Related topics