Update on single dispatch benchmark comparison to C++

cxx
benchmark

#1

Hi,

I’m making an update of the benchmarks I had ran back in 2015, comparing the performance of Julia dispatch on single argument against C++ dispatch: https://groups.google.com/forum/#!topic/julia-users/pTywc8Lcd28
The code is in here: https://github.com/cdsousa/SingleDispatchJuliaVsCXX.

Output of C++:

concr
0.001477 seconds
abstr
0.007114 seconds
(sum=500000500000)

Output of Julia Nighty (0.7.0-DEV.2587):

concr
  0.000758 seconds
abstr
  0.027950 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
  0.007648 seconds
union
  0.008517 seconds
(sum=500000500000)

And output of Julia 0.6.1:

concr
  0.000535 seconds
abstr
  0.030308 seconds (999.49 k allocations: 15.251 MiB)
manual dispatch
  0.008486 seconds
union
  0.037312 seconds (999.49 k allocations: 15.251 MiB, 24.76% gc time)
(sum=500000500000)

Legend:

  • concr means no dispatch (concrete type known at compile time)
  • abstr means dispatching from an abstract (base) type reference
  • manual means dispatching “by hand” using if/else blocks over the return of typeof
  • union means dispatching from a union of the two concrete types

I’m very happy with the times for the Union case! I think we can say it is almost on par with C++ vtable.

Update: These times shall be taken with a (lot) grain of salt. See replies bellow.

Update 2: These times do not measure dispatching (at CUP instructions level) alone. They measure a possible algorithm structure using derivative types in more or less idiomatic approaches in two languages. As with almost all benchmarks, a lot of things counts for the timings: compilation optimizations (inlining), cache hittings, branch prediction, etc.

Update 3: See some new measures in replies below.


#3

Do you have any ideas why the concr case got slower between v0.6.1 and v0.7 master?
A 42% slowdown, in what is the most common case, is rather worrying.


#4

Those times have a variation of 10 or 20%, even so, the difference seems to be meaningful.
But I have no clue for the reason for that.


#5

I’ll try running your code on my system, see if I can figure anything out.
This was very interesting information about C++ vs. Julia performance, good work! :+1:


#6

I would suggest using the BenchmarkTools package to get more consistent timings, especially when timing fast things like the concrete dispatch. It should be sufficient to do:

abstract type AbstrType end
struct ConcrType1 <: AbstrType; x::Int; end
struct ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 1_000_000
@btime sum(f, $([ConcrType1(i) for i=1:n]));
@btime sum(f, $(AbstrType[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));
@btime sum(f, $(Union{ConcrType1, ConcrType2}[rand(Bool) ? ConcrType1(i) : ConcrType2(i) for i=1:n]));

to time the three main cases (concrete, abstract, union). You don’t need to put it into a function if you splice in non-const arguments with $, and you don’t need to manually write multiple f methods since these will be created automatically by the compiler.


#7

I agree @stevengj . Plus, your code is elegantly compact! :smiley:

I’ll probably update the code, but I’ll also keep naive version as it is the way I’m doing it in C++.


#8

I did a little bit of playing with this benchmark.

So, first note that the concrete case depends a lot on inlining. I think that C++ is forbidden from inlining virtual functions; once you add @noinline, the julia and C++ timings for concrete become comparable.

Second, you don’t really measure dispatch speed; you also have differing modes of array access (direct and indirect). This can be relieved by making your types mutable.

Third, I think that the timings are extremely unreliable, in the sense of depending very strongly on the state of the allocator on creation. Compare

abstract type AbstrType end
type ConcrType1 <: AbstrType; x::Int; end
type ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 10_000_000
gc()
const a1 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
const a2 =  Vector{ConcrType1}([ConcrType1(i) for i=1:n])
shuffle!(a2)

sum(f, a1)
@time sum(f, a1)
@time sum(f, a2)

@time sum(f, a1)
@time sum(f, a2)

for an output of

  0.029130 seconds (5 allocations: 176 bytes)
  0.157367 seconds (5 allocations: 176 bytes)
  0.027281 seconds (5 allocations: 176 bytes)
  0.157395 seconds (5 allocations: 176 bytes)

#9

Can that be related to CPU branch prediction?
I use randomness when selecting the derived type of the object to try to avoid branch prediction, but maybe I should also do it for the array values ordering…


#10

I’m getting very different results when running the code from command line (vs REPL)…
As some had said, these times must be taken with a grain of salt.


#11

Regarding memory allocation: Consider

abstract type AbstrType end
type ConcrType1 <: AbstrType; x::Int; end
type ConcrType2 <: AbstrType; x::Int; end
f(a) = a.x
n = 10_000_000


gc()
@time const a1 = Vector{ConcrType1}([ConcrType1(i) for i=1:n])
@time const a2 =  Vector{ConcrType1}([ConcrType1(i) for i=1:n])
@time shuffle!(a2)
a3 = Vector{ConcrType1}()
@time for i=1:n
    push!(a3, ConcrType1(i) )
end

function main(a1,a2,a3)
    println("a1")
    @time sum(f, a1)
    println("a2")
    @time sum(f, a2)
    println("a3")
    @time sum(f, a3)
    println("------")
end

main(a1,a2,a3)
main(a1,a2,a3)
main(a1,a2,a3)

For an output of

$ julia jtest2.jl
  0.550571 seconds (10.01 M allocations: 229.755 MiB, 62.68% gc time)
  0.545317 seconds (10.00 M allocations: 228.882 MiB, 69.09% gc time)
  0.449771 seconds (6.95 k allocations: 387.315 KiB)
 40.855672 seconds (40.00 M allocations: 891.983 MiB, 6.86% gc time)
a1
  0.023915 seconds
a2
  0.157738 seconds
a3
  0.060393 seconds
------
a1
  0.023734 seconds
a2
  0.156797 seconds
a3
  0.060907 seconds
------
a1
  0.023627 seconds
a2
  0.156888 seconds
a3
  0.060060 seconds
------

Ok, what happened between a1 and a3?

Well, a3 was built on the REPL. This slowed down construction and led to extra allocations (probably because the integer i is boxed?), which led to less contiguous memory layout. This made a3 a “reliably bad” array, in terms of cache prediction. a1 as pretty good, and a2 is “maximally bad”.

[edit: a3 was not built on the repl, but on the top level scope. same problem.]


#12

I see… You are right.
But I’m not being too unfair to C++ (except for the inline part which I’ll correct) as for the concrete case it is a vector of values and for the abstract case it is a vector of pointers.
But I agree, this is not measuring the dispatch alone, it measuring a lot of suff (which I initially put to try to avoid branch prediction…)

I see. Lost of cache influence arises when using arrays for the benchmark…


#13

Times for a new version taking into account some comments above.

Output of C++:

concr
0.001477 seconds
abstr
0.007114 seconds

Output of Julia Nighty (0.7.0-DEV.2587):

concr
  1.556 ms (0 allocations: 0 bytes)
abstr
  43.301 ms (1999458 allocations: 30.51 MiB)
manual dispatch
  8.896 ms (0 allocations: 0 bytes)
union
  8.632 ms (0 allocations: 0 bytes)

Output of Julia 0.6.1:

concr
  1.573 ms (0 allocations: 0 bytes)
abstr
  61.823 ms (1999458 allocations: 30.51 MiB)
manual dispatch
  9.297 ms (0 allocations: 0 bytes)
union
  19.702 ms (999489 allocations: 15.25 MiB)

#14

It would be nice to see a graph with the evolution of the speed through different Julia versions.


#15

Re inlining: I am unsure what a fair comparison would be.

I think that automatic inlining of small virtual functions is a really cool feature of julia over C++, and the devs deserve to brag about it (but I am not so good at C++, take it with a grain of salt). However, the price of it is not benchmarked here: Code duplication because julia forgets binary compatibility of subclasses.

Re Juan: I agree, such a graph would be cool.


#16

I’m not sure what this means. Can you clarify?


#17

Sure, but please correct me if I’m wrong.

If I subclass in C++, and decide not to overwrite a virtual func (or the function was not virtual anyway), then no new code is generated. This is because subclasses are binary compatible: If B is a subclass of A, then functions compiled for A can seamlessly operate one instances of B.

As far as I know there is no way to tell Julia “this function can operate on the following types, but please compile code only once, they are binary compatible (for the purpose of this function)”.

Now, the C++ way is used a lot in practice: I subtype something from a library, without needing / compiling source-code for the lib: I need a header (or reverse the data layout), and the linker will find the methods for me. Nifty!

I think this does not work in julia, and as far as I gathered was never a design goal. This is not a problem for me, but you have to admit that this is pretty cool.

Edit: Ok, using union-types it might (?) be possible to tell julia this. But this is not the point: The author of the original method knows nothing of my new type, and I as later type-author declare that methods applying to type A also apply to my new type, without re-compilation.


#18