Performance drawback with subtyping

lmiq · December 18, 2020, 9:32pm

What I am mostly disappointed for now is on the allocations and speed of the functors. I still think that I have done something wrong there. I expected that to be similar to the splitting, and not allocating.

Very nice that comparison. They are important to see what we can actually expect.

Ps. And seeing that Java syntax makes me happy to do things in Julia

Skoffer · December 19, 2020, 3:33pm

A few thoughts in regards to these calculations.

First, annotated union splitting is redundant, since when we have branched to some type of the variable, the compiler already knows the type and can figure out everything. All differences in times that one can see are spurious since the compiler code is the same. It can be easily checked with @code_warntype macro (I only leave relevant parts here).

julia> @code_warntype paint2(p)
...
2 ┄─ %7  = @_3::Tuple{LineAbstract, Int64}::Tuple{LineAbstract, Int64}
│          (l = Core.getfield(%7, 1))
│    %9  = Core.getfield(%7, 2)::Int64
│    %10 = (l isa Main.Line1)::Bool
└───       goto #4 if not %10
3 ── %12 = s::Float64
│    %13 = Main.paint(l::Line1)::Float64
│    %14 = Main.f(%13)::Float64
│          (s = %12 + %14)
└───       goto #12
...

julia> @code_warntype paint2_annotated(p)
...
2 ┄─ %7  = @_3::Tuple{LineAbstract, Int64}::Tuple{LineAbstract, Int64}
│          (l = Core.getfield(%7, 1))
│    %9  = Core.getfield(%7, 2)::Int64
│    %10 = (l isa Main.Line1)::Bool
└───       goto #4 if not %10
3 ── %12 = s::Float64
│    %13 = Core.typeassert(l::Line1, Main.Line1)::Line1
│    %14 = Main.paint(%13)::Float64
│    %15 = Main.f(%14)::Float64
│          (s = %12 + %15)
└───       goto #12
...

As one can see, the only difference is %13 = Core.typeassert(l::Line1, Main.Line1)::Line1 which is removed on later stages.

Secondly, the reason why functors are not working is more subtle. On their own, in this example functors are working as function barriers and should have made calculations type stable. But the problem is that when number of kernel functions is too big, compiler gives up. One can see it, by comparing @code_warntype in case of 2 and 5 subtypes.

With 2 subtypes

julia> @code_warntype paint3(p)
Variables
  #self#::Core.Const(paint3)
  p::Picture{LineAbstract}
  @_3::Union{Nothing, Tuple{LineAbstract, Int64}}
  s::Float64
  l::LineAbstract

Body::Float64
1 ─       (s = 0.0)
│   %2  = Base.getproperty(p, :lines)::Vector{LineAbstract}
│         (@_3 = Base.iterate(%2))
│   %4  = (@_3 === nothing)::Bool
│   %5  = Base.not_int(%4)::Bool
└──       goto #4 if not %5
2 ┄ %7  = @_3::Tuple{LineAbstract, Int64}::Tuple{LineAbstract, Int64}
│         (l = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│   %10 = s::Float64
│   %11 = (l)()::Float64
│   %12 = Main.f(%11)::Float64
│         (s = %10 + %12)
│         (@_3 = Base.iterate(%2, %9))
│   %15 = (@_3 === nothing)::Bool
│   %16 = Base.not_int(%15)::Bool
└──       goto #4 if not %16
3 ─       goto #2
4 ┄       return s

With 5 subtypes

julia> @code_warntype paint3(p)
Variables
  #self#::Core.Const(paint3)
  p::Picture{LineAbstract}
  @_3::Union{Nothing, Tuple{LineAbstract, Int64}}
  s::Float64
  l::LineAbstract

Body::Float64
1 ─       (s = 0.0)
│   %2  = Base.getproperty(p, :lines)::Vector{LineAbstract}
│         (@_3 = Base.iterate(%2))
│   %4  = (@_3 === nothing)::Bool
│   %5  = Base.not_int(%4)::Bool
└──       goto #4 if not %5
2 ┄ %7  = @_3::Tuple{LineAbstract, Int64}::Tuple{LineAbstract, Int64}
│         (l = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│   %10 = s::Float64
│   %11 = (l)()::Any
│   %12 = Main.f(%11)::Float64
│         (s = %10 + %12)
│         (@_3 = Base.iterate(%2, %9))
│   %15 = (@_3 === nothing)::Bool
│   %16 = Base.not_int(%15)::Bool
└──       goto #4 if not %16
3 ─       goto #2
4 ┄       return s

As one can see, codes are almost identical, except in the case of 5 subtypes, the compiler identifies the return type of functor as Any. It means that the compiler gives up on determining which function to run and just used dynamic dispatch. I can be wrong but for a small set of subtypes, the compiler uses an approach similar to union splitting (or maybe exactly union splitting), so one can get that there was almost no difference between union splitting and functors approach. It is not used always, since it leads to an exponentially growing code in the general case.

As the last point, I’ll repeat one more time, that this problem is not Julia related. Better data organization can provide more benefits than any fancy language technique. Interesting articles, that deserves to be read is this awesome post in biojulia and data locality post in game programming book. Last one is especially interesting, because it is written in C++ and heavily uses OOP approach, but in the end author get the same conclusions: memory layout matter, because CPU is fast and memory is slow. Nothing good can came out if you are beating your cache to a pulp.

Consider following code

struct Picture2
    objects::Vector
end 

p2 = let p = p
    p2 = [Line1[], Line2[], Line3[], Line4[], Line5[]]
    for l in p.lines
        if l isa Line1
            push!(p2[1], l)
        elseif l isa Line2
            push!(p2[2], l)
        elseif l isa Line3
            push!(p2[3], l)
        elseif l isa Line4
            push!(p2[4], l)
        elseif l isa Line5
            push!(p2[5], l)
        end
    end
    Picture2(p2)
end

function paint(p::Picture2)
  s = 0.
  for l in p.objects
      s += sum(x -> f(paint(x)), l)
  end
  s
end

And timings

julia> print(" with runtime dispatch: "); @btime paint($p)
 with runtime dispatch: 53.530 ms (1000000 allocations: 15.26 MiB)

julia> print(" with splitting: "); @btime paint2($p)
 with splitting:   16.914 ms (0 allocations: 0 bytes)

julia> print(" simple sum of an array of Float64 of same size: "); @btime naivesum($x)
 simple sum of an array of Float64 of same size:   7.131 ms (0 allocations: 0 bytes)

julia> print(" with proper data structure: "); @btime paint($p2)
 with proper data structure:   7.240 ms (10 allocations: 160 bytes)

Despite some small allocations, we are two times faster than union splitting and actually almost as fast as direct naive calculation.

This is what is most important in this story. When you put irregular data into some structure and then trying to traverse it sequentially you are working against the computer and as a result, slow down your calculations. One should avoid it if possible or accept the consequences.

lmiq · December 19, 2020, 4:56pm

One thing that is important is that you have provided the hint to the compiler (and a function barrier, I guess) by using the sum function. If one replaces that function with a loop:

function paint4(p::Picture2)
  s = 0.
  for l in p.objects
      for x in l
        s += f(paint(x))
      end
  end
  s
end

Things get very bad, even worse than with any of the other alternatives:

 with array of arrays, loop sum:   76.310 ms (3997450 allocations: 76.26 MiB)

I have implemented a naive mysum function and with that I recover almost the same time of your test (thus, it is the splitting, and not other properties of the implementation of Base.sum that make the difference).

Thus, complementing your post, the correct data structure is important, but we need to take advantage of that data structure to write type-stable function calls (for me, at least, that was not immediately obvious from your example, I could well imagine that the explicit loop would perform well, it is possible that with few types the compiler would split it automatically, now I can guess).

Anyway, I think all this is very didactic. One thing that I would like know is how the “single dispatch approach in C++” mentioned in the other thread relates to this problem, and with which benefits (or not).

Current code

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length

# Function to evaluate
f(x::Float64) = sin(x)

# Dynamical dispatch at runtime

function paint(p :: Picture)
  s = 0.
  for l in p.lines
    s += f(paint(l))
  end
  s
end

# Union splitting

function paint2(p::Picture)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += f(paint(l))
    elseif l isa Line2
      s += f(paint(l))
    elseif l isa Line3
      s += f(paint(l))
    elseif l isa Line4
      s += f(paint(l))
    elseif l isa Line5
      s += f(paint(l))
    end
  end
  s
end

# Union splitting with annotated calls

function paint2_annotated(p::Picture)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += f(paint(l::Line1))
    elseif l isa Line2
      s += f(paint(l::Line2))
    elseif l isa Line3
      s += f(paint(l::Line3))
    elseif l isa Line4
      s += f(paint(l::Line4))
    elseif l isa Line5
      s += f(paint(l::Line5))
    end
  end
  s
end

# Functors

(line::Line1)() = line.length
(line::Line2)() = line.length
(line::Line3)() = line.length
(line::Line4)() = line.length
(line::Line5)() = line.length

function paint3(p :: Picture)
  s = 0.
  for l in p.lines
    s += f(l())
  end
  s
end

# good data structure

struct Picture2
    objects::Vector
end 

function buildp2(p)
    p2 = [Line1[], Line2[], Line3[], Line4[], Line5[]]
    for l in p.lines
        if l isa Line1
            push!(p2[1], l)
        elseif l isa Line2
            push!(p2[2], l)
        elseif l isa Line3
            push!(p2[3], l)
        elseif l isa Line4
            push!(p2[4], l)
        elseif l isa Line5
            push!(p2[5], l)
        end
    end
    Picture2(p2)
end

function paint4(p::Picture2)
  s = 0.
  for l in p.objects
      for x in l
        s += f(paint(x))
      end
  end
  s
end

function paint4_sum(p::Picture2)
  s = 0.
  for l in p.objects
      s += sum(x->f(paint(x)),l)
  end
  s
end

function mysum(f,x)
  s = 0.
  for i in eachindex(x)
    s += f(x[i])
  end
  s
end

function paint4_mysum(p::Picture2)
  s = 0.
  for l in p.objects
      s += mysum(x->f(paint(x)),l)
  end
  s
end

# running

n = 1_000_000
x = rand(n)
line_types = [ Line1, Line2, Line3, Line4, Line5 ]
p = Picture([ line_types[rand(1:5)](x[i]) for i in 1:n ]);
p2 = buildp2(p)

@test paint(p) ≈ paint2(p) ≈ paint2_annotated(p) ≈ paint3(p) ≈ naivesum(x) ≈ paint4(p2) ≈ paint4_sum(p2) ≈ paint4_mysum(p2)

print(" with runtime dispatch: "); @btime paint($p) 
print(" with splitting: "); @btime paint2($p) 
print(" with annotated splitting: "); @btime paint2_annotated($p) 
print(" with functors: "); @btime paint3($p)
print(" with array of arrays, sum loop: "); @btime paint4($p2)
print(" with array of arrays, Base.sum: "); @btime paint4_sum($p2)
print(" with array of arrays, mysum: "); @btime paint4_mysum($p2)

# to compare with:

function naivesum(x)
  s = 0.
  for v in x
    s += f(v)
  end
  s
end
print(" simple sum of an array of Float64 of same size: "); @btime naivesum($x)

Benchmark:

 with runtime dispatch:   45.259 ms (1000000 allocations: 15.26 MiB)
 with splitting:   18.026 ms (0 allocations: 0 bytes)
 with annotated splitting:   17.790 ms (0 allocations: 0 bytes)
 with functors:   45.164 ms (1000000 allocations: 15.26 MiB)
 with array of arrays, sum loop:   71.294 ms (3997450 allocations: 76.26 MiB)
 with array of arrays, Base.sum:   6.962 ms (10 allocations: 160 bytes)
 with array of arrays, mysum:   7.447 ms (5 allocations: 80 bytes)
 simple sum of an array of Float64 of same size:   6.729 ms (0 allocations: 0 bytes)

paulmelis · May 24, 2021, 12:55pm

As a follow-up to What Python Creator Guido van Rossum Thinks of Rust, Go, Julia, and TypeScript - #24 by lmiq here’s a straightforward C++ port of @lmiq’s Julia code.

code

// g++ -o paint -O3 -march=native -W -Wall  paint.cpp
#include <sys/time.h>
#include <cstdio>
#include <stdlib.h>
#include <vector>

struct LineAbstract
{
    virtual double paint() =0;
};

struct LineA : public LineAbstract
{     
    LineA(double w) { width = w; }
    virtual double paint() { return width; }
    
    double width;
};

struct LineB : public LineAbstract
{        
    LineB(double l) { length = l; }
    virtual double paint() { return length; }
    
    double length;
};

struct Picture
{
    Picture(std::vector<LineAbstract*> l) { lines = l; }
    
    std::vector<LineAbstract*> lines;
};

// Dynamic dispatch at runtime 

double 
paint(Picture& p)
{
    double s = 0.0;
    
    for (auto l : p.lines)
        s += l->paint();
    
    return s;
}

// Union splitting 

double
paint2(Picture& p)
{
    double s = 0.0;
    LineA *a;
    LineB *b;
    
    for (auto l : p.lines)
    {
        a = dynamic_cast<LineA*>(l);        
        if (a != nullptr)
            s += a->paint();
        else
        {
            b = dynamic_cast<LineB*>(l);        
            if (b != nullptr)
                s += b->paint();
        }
    }
    
    return s;
}

// Functors

double line(LineA *a) { return a->width; }
double line(LineB *b) { return b->length; }

double 
paint3(Picture& p)
{
    double s = 0.0;
    
    for (auto l : p.lines)
    {
        if (dynamic_cast<LineA*>(l))
            s += line(static_cast<LineA*>(l));
        else if (dynamic_cast<LineB*>(l))
            s += line(static_cast<LineB*>(l));
    }
    
    return s;
}

// Naive sums

double 
naivesum(std::vector<double> x)
{
    double s = 0.0;
    for (auto v : x)
        s += v;
    return s;
}

double 
naivesum2(std::vector<double> x)
{
    double s = 0.0;
    for (unsigned int i = 0; i < x.size(); i++)
        s += x[i];
    return s;
}

double tdiff(struct timeval t0, struct timeval t1)
{
    return (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)/1000000.0;
}

int main()
{
    const int n = 1000000;
    
    std::vector<double> x;
    for (int i = 0; i < n; i++)
        x.push_back(random());
    
    std::vector<LineAbstract*> lines;
    for (int i = 0; i < n; i++)
    {        
        // Need the casts to make ? : expression get accepted
        lines.push_back((1.0*random()/RAND_MAX) < 0.5 ? (LineAbstract*)new LineA(x[i]) : (LineAbstract*)new LineB(x[i]));
    }
    
    Picture p(lines);

    struct timeval t0, t1;
    double t, res;
    
    // Store resulting sum, to make sure the calls aren't erased
    gettimeofday(&t0, NULL);
    res = paint(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("dynamic dispatch : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);
    
    gettimeofday(&t0, NULL);
    res = paint2(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("union splitting  : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = paint3(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("functors         : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = naivesum(x);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("naive sum        : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = naivesum2(x);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("naive sum 2      : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);
}

Results on my system (GCC 10.2) (updated to do the random filling of the array correctly):

dynamic dispatch : 0.006555 us per iteration (6.555000 ms total) [1073756018481283.000000]
union splitting  : 0.026228 us per iteration (26.228000 ms total) [1073756018481283.000000]
functors         : 0.026444 us per iteration (26.444000 ms total) [1073756018481283.000000]
naive sum        : 0.004331 us per iteration (4.331000 ms total) [1073756018481283.000000]
naive sum 2      : 0.001780 us per iteration (1.780000 ms total) [1073756018481283.000000]

Julia 1.6.1 results:

n = 1000000
 with dynamic dispatch:   6.644 ms (0 allocations: 0 bytes)
 with splitting:   5.892 ms (0 allocations: 0 bytes)
 with functors:   6.627 ms (0 allocations: 0 bytes)
 simple sum of an array of Float64 of same size:   6.623 ms (0 allocations: 0 bytes)

~~So the completely straightforward dynamic dispatch is easily fastest in C++.~~ Not any longer with the random fix (see edit 2 below)

Edit: the initial naive sum used the iterator protocol, which apparently isn’t the fastest. I added a version that uses vector indexing, which is much faster, because it can address the underlying array directly.

Edit 2: see below for a porting error I made, random() in C/C++ returns an integer, so the array filled with a single type. Forcing the random value in [0,1] changes the results substantially.

lmiq · May 24, 2021, 1:23pm

Does it make sense that the sum with dispatch is faster than the simple sum of the elements of a simple vector of floats?

Is that what I must understand from the 4.24 ms of naive sum vs. 2.755 from dynamic dispatch?

paulmelis · May 24, 2021, 1:28pm

Good observation. I noticed that as well but didn’t look into it. I now added an even more naive version using straight vector indexing. This is indeed much faster, as expected.

lmiq · May 24, 2021, 2:25pm

I am playing with the code, and it seems that this generates a random number which is usually very large (not in the [0,1] range). If that is happening here it might be that the vector is all of the same type there (I don’t know if that makes any difference). What I wanted is to test with more types.

Edit:

If I change this in your code:

    for (int i = 0; i < n; i++) {
        double r = ((double) rand() / (RAND_MAX));
        x.push_back(r);
    }

    std::vector<LineAbstract*> lines;
    for (int i = 0; i < n; i++)
    {   
        // Need the casts to make ? : expression get accepted
        double r = ((double) rand() / (RAND_MAX));
        lines.push_back(r < 0.5 ? (LineAbstract*)new LineA(x[i]) : (LineAbstract*)new LineB(x[i]));
    }

To be sure that the random number is in the [0,1] interval, the time of dynamic dispatch doubles:

Original:

dynamic dispatch : 0.003501 us per iteration (3.501000 ms total) [1073756018481283.000000]

new:

dynamic dispatch : 0.006574 us per iteration (6.574000 ms total) [500006.610053]

In Julia I get:

 with dynamic dispatch:   7.334 ms (0 allocations: 0 bytes)
 with splitting:   6.304 ms (0 allocations: 0 bytes)
 with functors:   6.879 ms (0 allocations: 0 bytes)
 simple sum of an array of Float64 of same size:   7.201 ms (0 allocations: 0 bytes)

But note that even in Julia here the non-allocating execution indicates that splitting is occurring automatically. This is why the test evolved to something more “realistic” with 5 different types, and to avoid compiler tricks we also computed the sin of the field instead of simply returning the field, here.

Edit2: Introducing more types does seem to create a large overhead in Julia, and a much smaller one on C++:

Julia:

Summary

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length

# Dynamical dispatch at runtime

function paint(p :: Picture)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

# Union splitting

function paint2(p::Picture)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += paint(l)
    elseif l isa Line2
      s += paint(l)
    elseif l isa Line3
      s += paint(l)
    elseif l isa Line4
      s += paint(l)
    elseif l isa Line5
      s += paint(l)
    end
  end
  s
end

# Union splitting with annotated calls

function paint2_annotated(p::Picture)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += paint(l::Line1)
    elseif l isa Line2
      s += paint(l::Line2)
    elseif l isa Line3
      s += paint(l::Line3)
    elseif l isa Line4
      s += paint(l::Line4)
    elseif l isa Line5
      s += paint(l::Line5)
    end
  end
  s
end

# Functors

(line::Line1)() = line.length
(line::Line2)() = line.length
(line::Line3)() = line.length
(line::Line4)() = line.length
(line::Line5)() = line.length

function paint3(p :: Picture)
  s = 0.
  for l in p.lines
    s += l()
  end
  s
end

# good data structure

struct Picture2
    objects::Vector
end 

function buildp2(p)
    p2 = [Line1[], Line2[], Line3[], Line4[], Line5[]]
    for l in p.lines
        if l isa Line1
            push!(p2[1], l)
        elseif l isa Line2
            push!(p2[2], l)
        elseif l isa Line3
            push!(p2[3], l)
        elseif l isa Line4
            push!(p2[4], l)
        elseif l isa Line5
            push!(p2[5], l)
        end
    end
    Picture2(p2)
end

function paint4(p::Picture2)
  s = 0.
  for l in p.objects
      for x in l
        s += paint(x)
      end
  end
  s
end

function paint4_sum(p::Picture2)
  s = 0.
  for l in p.objects
      s += sum(x->paint(x),l)
  end
  s
end

function mysum(f,x)
  s = 0.
  for i in eachindex(x)
    s += f(x[i])
  end
  s
end

function paint4_mysum(p::Picture2)
  s = 0.
  for l in p.objects
      s += mysum(x->paint(x),l)
  end
  s
end

# running

n = 1_000_000
x = rand(n)
line_types = [ Line1, Line2, Line3, Line4, Line5 ]
p = Picture([ line_types[rand(1:5)](x[i]) for i in 1:n ]);
p2 = buildp2(p)

@test paint(p) ≈ paint2(p) ≈ paint2_annotated(p) ≈ paint3(p) ≈ paint4(p2) ≈ paint4_sum(p2) ≈ paint4_mysum(p2)

print(" with runtime dispatch: "); @btime paint($p) 
print(" with splitting: "); @btime paint2($p) 
print(" with annotated splitting: "); @btime paint2_annotated($p) 
print(" with functors: "); @btime paint3($p)
print(" with array of arrays, sum loop: "); @btime paint4($p2)
print(" with array of arrays, Base.sum: "); @btime paint4_sum($p2)
print(" with array of arrays, mysum: "); @btime paint4_mysum($p2)

C++ (hope I didn’t do something wrong):

Summary

// g++ -o paint -O3 -march=native -W -Wall  paint.cpp
#include <sys/time.h>
#include <cstdio>
#include <stdlib.h>
#include <vector>

struct LineAbstract
{
    virtual double paint() =0;
};

struct LineA : public LineAbstract
{     
    LineA(double w) { width = w; }
    virtual double paint() { return width; }
    
    double width;
};

struct LineB : public LineAbstract
{        
    LineB(double l) { length = l; }
    virtual double paint() { return length; }
    
    double length;
};

struct LineC : public LineAbstract
{        
    LineC(double l) { length = l; }
    virtual double paint() { return length; }
    
    double length;
};

struct LineD : public LineAbstract
{        
    LineD(double l) { length = l; }
    virtual double paint() { return length; }
    
    double length;
};

struct LineE : public LineAbstract
{        
    LineE(double l) { length = l; }
    virtual double paint() { return length; }
    
    double length;
};

struct Picture
{
    Picture(std::vector<LineAbstract*> l) { lines = l; }
    
    std::vector<LineAbstract*> lines;
};

// Dynamic dispatch at runtime 

double 
paint(Picture& p)
{
    double s = 0.0;
    
    for (auto l : p.lines)
        s += l->paint();
    
    return s;
}

//// Union splitting 
//
//double
//paint2(Picture& p)
//{
//    double s = 0.0;
//    LineA *a;
//    LineB *b;
//    LineC *c;
//    LineD *d;
//    LineE *e;
//    
//    for (auto l : p.lines)
//    {
//        a = dynamic_cast<LineA*>(l);        
//        if (a != nullptr)
//            s += a->paint();
//        else
//        {
//            b = dynamic_cast<LineB*>(l);        
//            if (b != nullptr)
//                s += b->paint();
//        }
//    }
//    
//    return s;
//}

// Functors

double line(LineA *a) { return a->width; }
double line(LineB *b) { return b->length; }
double line(LineC *c) { return c->length; }
double line(LineD *d) { return d->length; }
double line(LineE *e) { return e->length; }

double 
paint3(Picture& p)
{
    double s = 0.0;
    
    for (auto l : p.lines)
    {
        if (dynamic_cast<LineA*>(l))
            s += line(static_cast<LineA*>(l));
        else if (dynamic_cast<LineB*>(l))
            s += line(static_cast<LineB*>(l));
        else if (dynamic_cast<LineC*>(l))
            s += line(static_cast<LineC*>(l));
        else if (dynamic_cast<LineD*>(l))
            s += line(static_cast<LineD*>(l));
        else if (dynamic_cast<LineE*>(l))
            s += line(static_cast<LineE*>(l));
    }
    
    return s;
}

// Naive sums

double 
naivesum(std::vector<double> x)
{
    double s = 0.0;
    for (auto v : x)
        s += v;
    return s;
}

double 
naivesum2(std::vector<double> x)
{
    double s = 0.0;
    for (unsigned int i = 0; i < x.size(); i++)
        s += x[i];
    return s;
}

double tdiff(struct timeval t0, struct timeval t1)
{
    return (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)/1000000.0;
}

int main()
{
    const int n = 1000000;
    
    std::vector<double> x;
    for (int i = 0; i < n; i++) {
        double r = ((double) rand() / (RAND_MAX));
        x.push_back(r);
    }
    
    std::vector<LineAbstract*> lines;
    for (int i = 0; i < n; i++)
    {        
        double r = ((double) rand() / (RAND_MAX));
        if (r <= 0.2)
          lines.push_back((LineAbstract*)new LineA(x[i]));
        else if (r <= 0.4)
          lines.push_back((LineAbstract*)new LineB(x[i])); 
        else if (r <= 0.6)
          lines.push_back((LineAbstract*)new LineC(x[i])); 
        else if (r <= 0.8)
          lines.push_back((LineAbstract*)new LineD(x[i])); 
        else if (r <= 1.0)
          lines.push_back((LineAbstract*)new LineE(x[i])); 
    }
    
    Picture p(lines);

    struct timeval t0, t1;
    double t, res;
    
    // Store resulting sum, to make sure the calls aren't erased
    gettimeofday(&t0, NULL);
    res = paint(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("dynamic dispatch : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);
    
//    gettimeofday(&t0, NULL);
//    res = paint2(p);
//    gettimeofday(&t1, NULL);
//    t = tdiff(t0, t1);
//    printf("union splitting  : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = paint3(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("functors         : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = naivesum(x);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("naive sum        : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);

    gettimeofday(&t0, NULL);
    res = naivesum2(x);
    gettimeofday(&t1, NULL);
    t = tdiff(t0, t1);
    printf("naive sum 2      : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);
}

Results:

Julia:

julia disp.jl
 with runtime dispatch:   58.338 ms (2000000 allocations: 30.52 MiB)
 with splitting:   10.312 ms (0 allocations: 0 bytes)
 with annotated splitting:   10.581 ms (0 allocations: 0 bytes)
 with functors:   58.673 ms (2000000 allocations: 30.52 MiB)
 with array of arrays, sum loop:   88.891 ms (4997450 allocations: 91.51 MiB)
 with array of arrays, Base.sum:   288.108 μs (10 allocations: 160 bytes)
 with array of arrays, mysum:   1.052 ms (10 allocations: 160 bytes)

C++

./disp
dynamic dispatch : 0.008498 us per iteration (8.498000 ms total) [500006.610053]
functors         : 0.045593 us per iteration (45.593000 ms total) [500006.610053]
naive sum        : 0.004465 us per iteration (4.465000 ms total) [500006.610053]
naive sum 2      : 0.001772 us per iteration (1.772000 ms total) [500006.610053]

In these conditions C++ indeed does a better job. (but the message of changing the data structure can improve performance dramatically, as shown by the Base.sum example of the Julia code).

paulmelis · May 24, 2021, 4:04pm

Argh, you’re fully right, random() by default returns an integer in C++ (long even). So that would indeed fill the array with one a single type. Good catch. I updated the original C++ code and results above.

lmiq · May 24, 2021, 4:12pm

Nice. If you have some time at some moment to check the other C++ code with more types, to see if it makes sense, that would be nice. In my test it seems that Julia gets slow with more types, while C++ does not, but I could not do that carefully at the moment. Something that would be worth exploring to actually have a comprehensive view of this.

paulmelis · May 24, 2021, 4:16pm

As far as I can see your additions and fixes to the C++ code are fine.

anon56330260 · May 26, 2021, 10:45am

Here I have a (a bit complicated) single dispatch implementation in Julia.
Since my computer is an old one, I firstly re-ran @lmiq 's scripts (the one with 5 types) on my computer and observed consistent behavior:

Julia:

 with runtime dispatch: 
  64.770 ms (2000000 allocations: 30.52 MiB)
 with splitting:   11.558 ms (0 allocations: 0 bytes)
 with annotated splitting:   11.283 ms (0 allocations: 0 bytes)
 with functors:   61.150 ms (2000000 allocations: 30.52 MiB)
 with array of arrays, sum loop:   108.710 ms (4997450 allocations: 91.51 MiB)
 with array of arrays, Base.sum:   459.824 μs (10 allocations: 160 bytes)
 with array of arrays, mysum:   1.389 ms (10 allocations: 160 bytes)

C++:

dynamic dispatch : 0.009944 us per iteration (9.944000 ms total) [500006.610053]
functors         : 0.054375 us per iteration (54.375000 ms total) [500006.610053]
naive sum        : 0.006490 us per iteration (6.490000 ms total) [500006.610053]
naive sum 2      : 0.003097 us per iteration (3.097000 ms total) [500006.610053]

So their speed: C++ dynamic dispatch > union splitting >> Functor approach in both languages > runtime dispatch in Julia.

Here is the source code of my implementation of single dispatch. It has two files: functionwrapper.jl, which is adopted from FunctionWrappers and provide function pointers, and experiment.jl, which contains the benchmark codes. experiment.jl includes functionwrapper.jl. I define my SimpleFunctionWrapper instead of the original FunctionWrapper in FunctionWrappers to reduce some overhead (it can remove one mov instruction in the generated machine code, but it’s more limited). You just need to run experiment.jl to get the benchmark result. I only test this on Julia 1.5. Other versions of Julia might not work.

The result is:

 with single dispatch:   15.556 ms (0 allocations: 0 bytes)

I rum multiple time and the result is stable(Actually I think we should use @benchmark macro here to get more detailed report). It’s a bit slower than union-splitting and virtual function in C++. I don’t know why this happened, because the machine code Julia generates is quite clean and similar to C++'s (but with some redundancy). It’s really weird… I think function pointer still haves some overhead (maybe type checking still happens underhood). But anyway, the performance is within 2x of C++.

I also test single dispatch for two types (code and result not shown here) . The result is almost same with the 5 types version (15~16 ms).

functionwrapper.jl

module FunctionWrappers

# Used to bypass NULL check
if VERSION >= v"1.6.0-DEV.663"
    @inline function assume(v::Bool)
        Base.llvmcall(
            ("""
             declare void @llvm.assume(i1)
             define void @fw_assume(i8)
             {
                 %v = trunc i8 %0 to i1
                 call void @llvm.assume(i1 %v)
                 ret void
             }
             """, "fw_assume"), Cvoid, Tuple{Bool}, v)
    end
else
    @inline function assume(v::Bool)
        Base.llvmcall(("declare void @llvm.assume(i1)",
                       """
                       %v = trunc i8 %0 to i1
                       call void @llvm.assume(i1 %v)
                       ret void
                       """), Cvoid, Tuple{Bool}, v)
    end
end

if VERSION >= v"1.5.0"
    Base.@pure pass_by_value(T) = Base.allocatedinline(T)
else
    Base.@pure pass_by_value(T) = isbitstype(T)
end

Base.@pure is_singleton(@nospecialize(T)) = isdefined(T, :instance)
# Base.@pure get_instance(@nospecialize(T)) = Base.getfield(T, :instance)

@inline function convert_ret(::Type{Ret}, ret) where Ret
    # Only treat `Cvoid` as ignoring return value.
    # Treating all singleton as ignoring return value is also possible as shown in the
    # commented out implementation but it doesn't seem necessary.
    # The stricter rule may help catching errors and can be more easily changed later.
    Ret === Cvoid && return
    # is_singleton(Ret) && return get_instance(Ret)
    return convert(Ret, ret)
end

# Convert return type and generates cfunction signatures
Base.@pure map_rettype(T) =
    (pass_by_value(T) || T === Any || is_singleton(T)) ? T : Ref{T}
Base.@pure function map_cfunc_argtype(T)
    if is_singleton(T)
        return Ref{T}
    end
    return (pass_by_value(T) || T === Any) ? T : Ref{T}
end
Base.@pure function map_argtype(T)
    if is_singleton(T)
        return Any
    end
    return (pass_by_value(T) || T === Any) ? T : Any
end
Base.@pure get_cfunc_argtype(Obj, Args) =
    Tuple{Ref{Obj}, (map_cfunc_argtype(Arg) for Arg in Args.parameters)...}

mutable struct SimpleFunctionWrapper{Ret,Args<:Tuple}
    ptr::Ptr{Cvoid}
    function SimpleFunctionWrapper{Ret,Args}(obj::objT) where {Ret,Args,objT}
        new{Ret,Args}(gen_func_raw_ptr(Ret, Args,objT))
    end
end

@generated function gen_func_raw_ptr(::Type{Ret}, ::Type{Args}, ::Type{objT}) where {Ret,Args,objT}
    quote
        @cfunction($(objT.instance), 
        $(map_rettype(Ret)),
        ($(get_cfunc_argtype(objT, Args).parameters[2:end]...),))
    end
end

@generated function do_ccall(f::SimpleFunctionWrapper{Ret,Args}, args) where {Ret,Args}
    # Has to be generated since the arguments type of `ccall` does not allow
    # anything other than tuple (i.e. `@pure` function doesn't work).
    quote
        $(Expr(:meta, :inline))
        ptr = f.ptr
        assume(ptr!=C_NULL)
        ccall(ptr, $(map_rettype(Ret)),($((map_argtype(Arg) for Arg in Args.parameters)...),),$((:(convert($(Args.parameters[i]), args[$i]))
                         for i in 1:length(Args.parameters))...))
    end
end

@inline (f::SimpleFunctionWrapper)(args...) = do_ccall(f, args)
end

experimental.jl

include("functionwrapper.jl")
using BenchmarkTools
using Random
abstract type LineAbstract end
const FunType = FunctionWrappers.SimpleFunctionWrapper{Float64,Tuple{Ptr{Cvoid}}}
mutable struct LineA <: LineAbstract
    methods::FunType
    width::Float64
    function LineA(width::Float64)
        new(LineAmethods,width)
    end
end
mutable struct LineB <: LineAbstract
    methods::FunType
    length::Float64
    function LineB(length::Float64)
        new(LineBmethods,length)
    end
end
mutable struct LineC <: LineAbstract
    methods::FunType
    depth::Float64
    function LineC(depth::Float64)
        new(LineCmethods,depth)
    end
end
mutable struct LineD <: LineAbstract
    methods::FunType
    volume::Float64
    function LineD(volume::Float64)
        new(LineDmethods,volume)
    end
end
mutable struct LineE <: LineAbstract
    methods::FunType
    area::Float64
    function LineE(area::Float64)
        new(LineEmethods,area)
    end
end

mutable struct Picture
    lines::Vector{Ptr{Cvoid}}
    lines_ref::Vector{LineAbstract}
    function Picture(lines_ref::Vector{LineAbstract})
        new(Base.pointer_from_objref.(lines_ref),lines_ref)
    end
end

paint(l::LineA)::Float64 = l.width
paint(l::LineB)::Float64 = l.length
paint(l::LineC)::Float64 = l.depth
paint(l::LineD)::Float64 = l.volume
paint(l::LineE)::Float64 = l.area

function paintA(ptr::Ptr{Cvoid})::Float64
    lptr = Base.unsafe_convert(Ptr{LineA},ptr)
    l = Base.unsafe_load(lptr)
    return paint(l)
end
function paintB(ptr::Ptr{Cvoid})::Float64
    lptr = Base.unsafe_convert(Ptr{LineB},ptr)
    l = Base.unsafe_load(lptr)
    return paint(l)
end
function paintC(ptr::Ptr{Cvoid})::Float64
    lptr = Base.unsafe_convert(Ptr{LineC},ptr)
    l = Base.unsafe_load(lptr)
    return paint(l)
end
function paintD(ptr::Ptr{Cvoid})::Float64
    lptr = Base.unsafe_convert(Ptr{LineD},ptr)
    l = Base.unsafe_load(lptr)
    return paint(l)
end
function paintE(ptr::Ptr{Cvoid})::Float64
    lptr = Base.unsafe_convert(Ptr{LineE},ptr)
    l = Base.unsafe_load(lptr)
    return paint(l)
end

const LineAmethods = FunType(paintA)
const LineBmethods = FunType(paintB)
const LineCmethods = FunType(paintC)
const LineDmethods = FunType(paintD)
const LineEmethods = FunType(paintE)

@inline function singledispatch(l::Ptr{Cvoid})
    #l is a double pointer,since method is the first field of struct.
    ptr = Base.unsafe_convert(Ptr{Ptr{FunType}},l)
    value = Base.unsafe_load(Base.unsafe_load(ptr))
    return value(l)
end

function paint1(v::Picture)
    s = 0.0
    lines = v.lines
    for i in lines
        s += singledispatch(i)
    end
    return s
end

n = 1_000_000
x = rand(n)
line_types = [ LineA, LineB, LineC, LineD, LineE ]
p = Picture([line_types[rand(1:5)](x[i]) for i in 1:n ]);
print(" with single dispatch: ");@btime paint1($p)

Bardo · August 18, 2021, 2:09pm

@lmiq I took your comparison code to learn something about styles and pitfalls in Julia.

Could you guys help to edit the questions and comments such that they make sense?
Hope this is not hair-cutting, any insight would help!

Functors appear not (much) faster than dynamical dispatch. Can this be generalized?
Annotated splitting does not give an timing advantage (in some runs they were almost the same)
Why is double indexing so slow? Because of the index calculations or does it further trigger the dynamic dispatch?
Why is the Base.sum 4x faster than myfun? Less function call overhead or again an advantage in type infering?
Including the p2 type-sorting time makes the Base.sum less attractive than unsorted Union splitting. What is the correct advice: work on homogenous data, but don’t sort separately (allocating)?

--- Testing different dispatch methods (random-type data)
 with runtime dispatch:                 346.096 ms (2000000 allocations: 30.52 MiB)
 with functors:                         255.539 ms (2000000 allocations: 30.52 MiB)
 with splitting:                        22.508 ms (0 allocations: 0 bytes)
 with annotated splitting:              21.890 ms (0 allocations: 0 bytes)

Type-sort data (with splitting):        177.829 ms (100 allocations: 15.00 MiB)

--- Testing indexing and function call overheads (sorted-type data)
 with array of arrays, sum loop:        531.589 ms (4997450 allocations: 91.51 MiB)
 with array of arrays, mysum:           2.327 ms (10 allocations: 160 bytes)
 with array of arrays, Base.sum:        589.300 μs (10 allocations: 160 bytes)

using BenchmarkTools

# Define different types of lines
abstract type LineAbstract end
struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end

# Picture has as member a vector of lines of different types
struct Picture{T<:LineAbstract}
    lines::Vector{T}
end 

# Methods for different line types
paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length

# Dynamical dispatch 
# Program must look up at runtime what method to use
# -> Slow
function paint_dyndispatch(p :: Picture)
    s = 0.
    for l in p.lines
        s += paint(l)
    end
    s
end

# Union paint_splitting
# The code below might look odd at first sight: 
# Every type condition leads to the same operation,
# yet it allows the compiler to specialize on types
# -> Fast
function paint_splitting(p::Picture)
    s = 0.
    for l in p.lines
        if l isa Line1
            s += paint(l)
        elseif l isa Line2
            s += paint(l)
        elseif l isa Line3
            s += paint(l)
        elseif l isa Line4
            s += paint(l)
        elseif l isa Line5
            s += paint(l)
        end
    end
    s
end

# Union splitting with annotated calls
# Same as union splitting, plus explicitely telling the compiler what type to expect.
# -> Fast, but not faster than union splitting (sometimes a bit)
function paint_annsplit(p::Picture)
    s = 0.
    for l in p.lines
        if l isa Line1
            s += paint(l::Line1)
        elseif l isa Line2
            s += paint(l::Line2)
        elseif l isa Line3
            s += paint(l::Line3)
        elseif l isa Line4
            s += paint(l::Line4)
        elseif l isa Line5
            s += paint(l::Line5)
        end
    end
    s
end

# Functors, a fancy (functional?) style
# 
(line::Line1)() = line.length
(line::Line2)() = line.length
(line::Line3)() = line.length
(line::Line4)() = line.length
(line::Line5)() = line.length
function paint_functors(p :: Picture)
    s = 0.
    for l in p.lines
        s += l()
    end
    s
end

# Picture2 vector holds all picture lines grouped according to their type
struct Picture2
    objects::Vector
end 
function buildp2(p)
    p2 = [Line1[], Line2[], Line3[], Line4[], Line5[]]
    for l in p.lines
        if l isa Line1
            push!(p2[1], l)
        elseif l isa Line2
            push!(p2[2], l)
        elseif l isa Line3
            push!(p2[3], l)
        elseif l isa Line4
            push!(p2[4], l)
        elseif l isa Line5
            push!(p2[5], l)
        end
    end
    Picture2(p2)
end

# slow
function paint_sumloop(p::Picture2)
    s = 0.
    for l in p.objects
        for x in l
            s += paint(x)
        end
    end
    s
end

# 1000x faster
function paint_mysum(p::Picture2)
    s = 0.
    for l in p.objects
        s += mysum(x->paint(x),l)
    end
    s
end
function mysum(f,x)
    s = 0.
    for i in eachindex(x)
        s += f(x[i])
    end
    s
end

# another 4x faster
function paint_sum(p::Picture2)
    s = 0.
    for l in p.objects
        s += sum(x->paint(x),l)
    end
    s
end

# running comparison

n = 1_000_000
x = rand(n)
line_types = [ Line1, Line2, Line3, Line4, Line5 ]
p = Picture([ line_types[rand(1:5)](x[i]) for i in 1:n ]);
p2 = buildp2(p)

println("--- Testing different dispatch methods (random-type data)")
print(" with runtime dispatch:               ");  @btime paint_dyndispatch($p) 
print(" with functors:                       ");  @btime paint_functors($p)
print(" with splitting:                      ");  @btime paint_splitting($p) 
print(" with annotated splitting:            ");  @btime paint_annsplit($p)
println()
print("Type-sort data (with splitting):      ");  @btime buildp2($p)
println()
println("--- Testing indexing and function call overheads (sorted-type data)")
print(" with array of arrays, sum loop:      ");  @btime paint_sumloop($p2)
print(" with array of arrays, mysum:         ");  @btime paint_mysum($p2)
print(" with array of arrays, Base.sum:      ");  @btime paint_sum($p2)

lmiq · August 18, 2021, 2:51pm

That is the easy one, just use @inbounds and @simd:

function mysum(f,x)
    s = 0.
    @inbounds @simd for i in eachindex(x)
        s += f(x[i])
    end 
    s
end

and you get:


--- Testing indexing and function call overheads (sorted-type data)
 with array of arrays, mysum:           243.446 μs (10 allocations: 160 bytes)
 with array of arrays, Base.sum:        256.853 μs (10 allocations: 160 bytes)
500418.3271661904

I don’t see any reason for them to be.

That is probably dependent on how the compiler can figure out the output type of the functions being called. If it cannot infer the types, but you know that they have all the same type, probably annotating can give you some performance gain. But then probably it is better to understand why the inference is failing in the inner functions.

jling · August 18, 2021, 2:55pm

inbounds is inferred, because eachindex, you just need @simd

melonedo · August 26, 2021, 2:37pm

Functors appear not (much) faster than dynamical dispatch. Can this be generalized?

Functor is just another kind of function, despite the special syntax, it is nothing different from a “magic method” (e.g. __call__). so paint(l) and l() (or __call__(l)) are equivalent, they both need dynamic dispatch.

Annotated splitting does not give an timing advantage (in some runs they were almost the same)

You can check that with @code_typed, Julia already figures out the type with isa expression.

Topic		Replies	Views
Is UnionAll abstract? Is it bad for performance? Performance	1	428	October 19, 2020
Performance of Fields with Union Type General Usage performance , type	3	544	October 17, 2022
Union splitting vs C++ Performance	22	4209	July 9, 2021
Improving performance due to setting types of the fields of a struct Performance	10	1165	July 12, 2019
Union splitting over abstract types General Usage performance	5	849	August 26, 2020

Performance drawback with subtyping

Related topics