Union splitting vs C++

lmiq · May 25, 2021, 12:21pm

This is a continuation from this thread: Performance drawback with subtyping - #31 by paulmelis

Here we have two codes, which compute someting simple (the sum of the values of a field of a type in an array of mixed types of objects). Typical case where it is hard to be type-stable and avoid run-time dispatch.

With 2 different types in the list of types, union splitting kicks in in Julia and the code is fast. With more (in the present example 5 types), it does not. In the C++ code clearly it does kick in and the code is fast. Thus, the C++ compiler is doing a better job here.

There are many alternatives to these codes (like restructuring the data to avoid mixed containers, and that is by far the best option in terms of performance, or using a macro to automatically do the union splitting for any number of subtypes). But this is not the point.

The point is that clearly the most simple way to write the code is faster in C++ than in Julia, because the compiler is more aggressive in C++ (or at least this is what I understand is happening).

So, the question is: is it possible to tune some compiler option in Julia so that the compiler more aggressively optimizes this code? (specifically, the number of subtypes above which it quits trying to do union splitting?).

This is a recurrent subject here, and here we have a clear example of C++ being faster than Julia for the code written in the “most natural” way. (not saying that the C++ is more natural in general, it is painfully ugly ). But the structure of the code is the same.

These are the benchmarks:

C++:

g++ -o disp -O3 -march=native -W -Wall  disp.cpp
./disp
n = 1000000 
dynamic dispatch : 0.008514 us per iteration (8.514000 ms total) [500006.610053]

Julia (1.6.1):

julia disp.jl 
result = 499496.79878554505
 with runtime dispatch:   59.806 ms (2000000 allocations: 30.52 MiB)
 with manual splitting:   8.133 ms (0 allocations: 0 bytes)

And the codes:

C++

// g++ -o disp -O3 -march=native -W -Wall  disp.cpp
#include <sys/time.h>
#include <cstdio>
#include <stdlib.h>
#include <vector>

struct LineAbstract
{
    virtual double paint() =0;
};

struct LineA : public LineAbstract
{     
    LineA(double l) { length = l; }
    virtual double paint() { return length; }
    double length;
};

struct LineB : public LineAbstract
{        
    LineB(double l) { length = l; }
    virtual double paint() { return length; }
    double length;
};

struct LineC : public LineAbstract
{        
    LineC(double l) { length = l; }
    virtual double paint() { return length; }
    double length;
};

struct LineD : public LineAbstract
{        
    LineD(double l) { length = l; }
    virtual double paint() { return length; }
    double length;
};

struct LineE : public LineAbstract
{        
    LineE(double l) { length = l; }
    virtual double paint() { return length; }
    double length;
};

struct Picture
{
    Picture(std::vector<LineAbstract*> l) { lines = l; }
    std::vector<LineAbstract*> lines;
};

// Dynamic dispatch at runtime 
double paint(Picture& p)
{
    double s = 0.0;
    
    for (auto l : p.lines)
        s += l->paint();
    
    return s;
}

double tdiff(struct timeval t0, struct timeval t1)
{
    return (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)/1000000.0;
}

int main()
{
    int n = 1000000;
    printf("n = %i \n",n);
    
    std::vector<LineAbstract*> lines;
    for (int i = 0; i < n; i++)
    {        
        double r = ((double) rand() / (RAND_MAX));
        if (r <= 0.2)
          lines.push_back((LineAbstract*)new LineA(r));
        else if (r <= 0.4)
          lines.push_back((LineAbstract*)new LineB(r)); 
        else if (r <= 0.6)
          lines.push_back((LineAbstract*)new LineC(r)); 
        else if (r <= 0.8)
          lines.push_back((LineAbstract*)new LineD(r)); 
        else if (r <= 1.0)
          lines.push_back((LineAbstract*)new LineE(r)); 
    }
    
    Picture p(lines);

    struct timeval t0, t1;
    double t, res;
    
    // Store resulting sum, to make sure the calls aren't erased
    gettimeofday(&t0, NULL);
    res = paint(p);
    gettimeofday(&t1, NULL);
    t = tdiff(t0,t1);
    printf("dynamic dispatch : %.6f us per iteration (%.6f ms total) [%.6f]\n", t/n*1000000, t*1000, res);
    
}

Julia

module Lines

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length

# Dynamical dispatch at runtime

function paint1(p)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

# Union splitting

function paint2(p)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += paint(l)
    elseif l isa Line2
      s += paint(l)
    elseif l isa Line3
      s += paint(l)
    elseif l isa Line4
      s += paint(l)
    elseif l isa Line5
      s += paint(l)
    end
  end
  s
end

function run(n)

  line_types = [ Line1, Line2, Line3, Line4, Line5 ]
  p = Picture([line_types[rand(1:5)](rand()) for i in 1:n])

  @test paint1(p) ≈ paint2(p)
  println("result = ", paint1(p))

  print(" with runtime dispatch: "); @btime paint1($p) 
  print(" with splitting: "); @btime paint2($p) 
end

end

# running
Lines.run(1_000_000)

edit: updated the julia code to make it easier to test things with Revise.

paulmelis · May 25, 2021, 12:42pm

Beauty is in the eye of the beholder

stevengj · May 25, 2021, 12:47pm

My understanding is that dynamic dispatch is a lot cheaper in C++ than in Julia because OOP-style single dispatch can use vtables.

(It’s interesting to think about how one might go about writing a package that implements vtables for dynamic single dispatch in Julia.)

rdeits · May 25, 2021, 1:07pm

I played around with a very simple version of this in https://github.com/rdeits/ConcreteInterfaces.jl/blob/master/demo.ipynb

The only problem is that FunctionWrappers has had a long series of performance and stability problems in different Julia versions, e.g. cfunction issue with FunctionWrappers · Issue #28669 · JuliaLang/julia · GitHub Inline allocated non-bits type fails on 1.1.2 · Issue #23 · yuyichao/FunctionWrappers.jl · GitHub which appear to only be fixable in Julia base itself.

If there was one thing I could ask for in Julia, it might actually be a reliably stable and performant function wrapper. Lack of that has basically stalled progress over at GitHub - tkoolen/Parametron.jl: Efficiently solving instances of a parameterized family of (possibly mixed-integer) linear/quadratic optimization problems in Julia too.

kristoffer.carlsson · May 25, 2021, 1:10pm

I don’t think this is true. Julia does not do union splitting on abstract types because that would require invalidating code when adding a new subtype to the abstract type. What matters here is the number of paint method defined. If there are only two methods defined for a function, Julia will look at the result of those when looking at the function applied to an abstract type, otherwise I think it just gives Any immidately.

For example, in

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end

struct Picture
   lines::Vector{LineAbstract}
end

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
# paint(l::String) = "foo"


# Dynamical dispatch at runtime

function paint(p)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

code_warntype(paint, Tuple{Picture})

Uncommenting the paint(l::String) = "foo" line will make s in paint no longer infer to Float64.

FPGro · May 25, 2021, 1:10pm

There is an inference parameter MAX_UNION_SPLITTING, although I must admit that I never tried to change it or know if this is even possible in a running session.

github.com

JuliaLang/julia/blob/bb5b98e72a151c41471d8cc14cacb495d647fb7f/base/compiler/types.jl#L103


      
          
          # don't consider more than N methods. this trades off between
          # compiler performance and generated code performance.
          # typically, considering many methods means spending lots of time
          # obtaining poor type information.
          # It is important for N to be >= the number of methods in the error()
          # function, so we can still know that error() is always Bottom.
          MAX_METHODS::Int
          # the maximum number of union-tuples to swap / expand
          # before computing the set of matching methods
          MAX_UNION_SPLITTING::Int
          # the maximum number of union-tuples to swap / expand
          # when inferring a call to _apply_iterate
          MAX_APPLY_UNION_ENUM::Int
          
          # parameters limiting large (tuple) types
          TUPLE_COMPLEXITY_LIMIT_DEPTH::Int
          
          # when attempting to inline _apply_iterate, abort the optimization if the
          # tuple contains more than this many elements
          MAX_TUPLE_SPLAT::Int

Sukera · May 25, 2021, 1:20pm

That seems rather harsh to me - intuitively, I’d expect to at least give a Union of return types if the number of methods is too large, since (naively) I’d expect multiple methods to at least have similar return types. I may very well be wrong though.

lmiq · May 25, 2021, 1:21pm

Interesting. If one changes that part to:

function paint1(p)
  s = 0.
  for l in p.lines
    s += l.length
  end
  s
end

Thus completely removing the calls to paint above and disambiguating the name of the function that does the sum to paint1, we get a slightly faster code, but still:

julia disp.jl
result = 500327.3302118598
 with runtime dispatch:   38.532 ms (2000000 allocations: 30.52 MiB)
 with manual splitting:   9.108 ms (0 allocations: 0 bytes)

The inference of the output could be improved by annotating the output type there:

function paint1(p)
  s = 0.
  for l in p.lines
    s += paint(l)::Float64
  end
  s
end

and that improves allocations, but the time is not very good:

julia disp.jl
result = 500785.78210850916
 with runtime dispatch:   46.824 ms (1000000 allocations: 15.26 MiB)
 with manual splitting:   8.463 ms (0 allocations: 0 bytes)

lmiq · May 25, 2021, 1:44pm

syntactically one appears to be able to change those parameters with:

# running
Core.Compiler.InferenceParams(union_splitting=5,max_methods=5)

but I cannot see any effect, so probably this is not working.

Elrod · May 25, 2021, 2:38pm

struct Picture2{T<:Union{Line1,Line2,Line3,Line4,Line5}}
    lines::Vector{T}
end
function paint3(p)
    s = 0.
    @inbounds for i in eachindex(p.lines)
        s += paint(p.lines[i])
    end
    s
end

function run(n)

  line_types = [ Line1, Line2, Line3, Line4, Line5 ]
  p = Picture([line_types[rand(1:5)](rand()) for i in 1:n])
  p2 = Picture2(convert(Vector{Union{Line1,Line2,Line3,Line4,Line5}}, p.lines))
  @test paint1(p) ≈ paint2(p) ≈ paint3(p2)
  println("result = ", paint1(p))

  print(" with runtime dispatch: "); @btime paint1($p) 
  print(" with splitting: "); @btime paint2($p) 
  print(" with compiler splitting: "); @btime paint3($p2) 
end

Results:

julia> run(10)
result = 5.4283370344136666
 with runtime dispatch:   357.996 ns (20 allocations: 320 bytes)
 with splitting:   15.889 ns (0 allocations: 0 bytes)
 with compiler splitting:   4.765 ns (0 allocations: 0 bytes)
5.4283370344136666

julia> run(10000)
result = 5009.613041629238
 with runtime dispatch:   597.098 μs (20000 allocations: 312.50 KiB)
 with splitting:   71.937 μs (0 allocations: 0 bytes)
 with compiler splitting:   11.153 μs (0 allocations: 0 bytes)
5009.613041629238

For some reason it won’t SIMD, but otherwise it’s been optimized:

julia> typeof(p2)
Picture2{Union{Line1, Line2, Line3, Line4, Line5}}

julia> @code_llvm debuginfo=:none paint3(p2)

define double @julia_paint3_1944([1 x {} addrspace(10)*] addrspace(11)* nocapture nonnull readonly align 8 dereferenceable(8) %0) #0 !dbg !5 {
top:
  %1 = bitcast [1 x {} addrspace(10)*] addrspace(11)* %0 to {} addrspace(10)* addrspace(10)* addrspace(11)*, !dbg !7
  %2 = load atomic {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)* addrspace(11)* %1 unordered, align 8, !dbg !7, !tbaa !14, !nonnull !4, !dereferenceable !18, !align !19
  %3 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(10)* %2, i64 3, !dbg !20
  %4 = bitcast {} addrspace(10)* addrspace(10)* %3 to i64 addrspace(10)*, !dbg !20
  %5 = addrspacecast i64 addrspace(10)* %4 to i64 addrspace(11)*, !dbg !20
  %6 = load i64, i64 addrspace(11)* %5, align 8, !dbg !20, !tbaa !30, !range !33
  %.not = icmp eq i64 %6, 0, !dbg !34
  br i1 %.not, label %L60, label %L14.lr.ph, !dbg !37

L14.lr.ph:                                        ; preds = %top
  %7 = bitcast {} addrspace(10)* addrspace(10)* %2 to [1 x i64] addrspace(13)* addrspace(10)*, !dbg !38
  %8 = addrspacecast [1 x i64] addrspace(13)* addrspace(10)* %7 to [1 x i64] addrspace(13)* addrspace(11)*, !dbg !38
  %9 = load [1 x i64] addrspace(13)*, [1 x i64] addrspace(13)* addrspace(11)* %8, align 8, !dbg !38, !tbaa !43, !nonnull !4
  br label %L14, !dbg !45

L14:                                              ; preds = %L14, %L14.lr.ph
  %value_phi120 = phi i64 [ 0, %L14.lr.ph ], [ %14, %L14 ]
  %value_phi19 = phi double [ 0.000000e+00, %L14.lr.ph ], [ %13, %L14 ]
  %10 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(13)* %9, i64 %value_phi120, i64 0, !dbg !46
  %11 = bitcast i64 addrspace(13)* %10 to double addrspace(13)*, !dbg !46
  %12 = load double, double addrspace(13)* %11, align 8, !dbg !46, !tbaa !47
  %13 = fadd fast double %12, %value_phi19, !dbg !50
  %14 = add nuw nsw i64 %value_phi120, 1, !dbg !53
  %exitcond.not = icmp eq i64 %14, %6, !dbg !56
  br i1 %exitcond.not, label %L60, label %L14, !dbg !45, !llvm.loop !57

L60:                                              ; preds = %L14, %top
  %value_phi4 = phi double [ 0.000000e+00, %top ], [ %13, %L14 ]
  ret double %value_phi4, !dbg !58
}

Probably a pass ordering problem, because llvm opt has no problem SIMD-ing the above. So I assume that the code was still too messy when the autovectorizer ran, and only got cleaned up later.
It seems to take two runs through opt, the default optimization passes it runs aren’t enough when given the @code_llvm optimize=false output instead of the @code_llvm.
EDIT: Actually, all it takes for LLVM to vectorize the unoptimized Julia IR is deleting the call to @julia.loopinfo_marker(). Maybe this interferes with Julia’s vectorization as well?

Skoffer · May 25, 2021, 2:51pm

Are there any limitations to the number of types which is used in Union for this implementation to be effective? I mean, is there any threshold, like more than some X and suddenly it stops working.

I remember vaguely from the manual, that too many types in Union is ineffective.

lmiq · May 25, 2021, 3:04pm

That is interesting. But the result is a combination of the Union type (which I have tried before) and the way you wrote the loop. With the original loop the result is very slow:

result = 500186.5566246887
 with runtime dispatch:   58.996 ms (2000000 allocations: 30.52 MiB)
 with splitting:   9.357 ms (0 allocations: 0 bytes)
 with compiler splitting:   188.885 ms (6998980 allocations: 122.05 MiB)
 with runtime dispatch 3:   59.901 ms (2000000 allocations: 30.52 MiB)
 with compiler splitting 3:   1.027 ms (0 allocations: 0 bytes)

Updated code

module Lines

export test
export Line1, Line2, Line3, Line4, Line5
export LineUnion, Picture, Picture2

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

const LineUnion = Union{Line1,Line2,Line3,Line4,Line5}
struct Picture2{T<:LineUnion}
       lines::Vector{T}
end 

# Dynamical dispatch at runtime

function paint1(p)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

function paint3(p)
  s = 0.
  @inbounds for i in eachindex(p.lines)
    s += paint(p.lines[i])
  end
  s
end

# Union splitting

function paint2(p)
  s = 0.
  for l in p.lines
    if l isa Line1
      s += paint(l)
    elseif l isa Line2
      s += paint(l)
    elseif l isa Line3
      s += paint(l)
    elseif l isa Line4
      s += paint(l)
    elseif l isa Line5
      s += paint(l)
    end
  end
  s
end

function test(n)

  line_types = [ Line1, Line2, Line3, Line4, Line5 ]
  p = Picture([line_types[rand(1:5)](rand()) for i in 1:n])
  p2 = Picture2(convert(Vector{LineUnion},p.lines))

  @test paint1(p) ≈ paint2(p)
  println("result = ", paint1(p))

  print(" with runtime dispatch: "); @btime paint1($p) 
  print(" with splitting: "); @btime paint2($p) 
  print(" with compiler splitting: "); @btime paint1($p2) 
  print(" with runtime dispatch 3: "); @btime paint3($p) 
  print(" with compiler splitting 3: "); @btime paint3($p2) 

end

end

using .Lines
test(1_000_000)

anon56330260 · May 25, 2021, 3:23pm

Vtable alone is not enough. You need to use function poiners and object pointers together to get around type inference. Virtual function is simply a function taking an object pointer as the first parameter (that is self). So we need to do something like this:

f1(self::Ptr{Void},args...) = ...
f2(self::Ptr{Void},args...) = ...
#then use FunctionWrapper to wrap f1 and f2
mutable struct S
    methodtable::MethodList{FunctionPointer}
    #add custom function table here, MethodList{FunctionPointer} is a pointer array
    #you can also store class property 
    #other field
end

This is basically how C++ people implement their vtables. Ptr{Void} is a must to eliminate different types. So you need to use mutable types to take the pointer, then cast it to Ptr{Void}. You can also use immutable types if they are wrapped in Base.RefValue(like shared_ptr in C++). And you now don’t have arrays with abstract types, you have array with Ptr{Void}. To call a specific method, we need to calculate the corresponding offset in method table (in compile time), and invoke that function with the pointer.
This approach actually works in current Julia(I have tried a small prototype in a raytracer project), but programmer needs to create and maintain method table manually. Also, since raw pointer is not tracked by GC, sometimes we need to ensure the lifetime of the object is long enough by using GC.@preserve.
So the core problem here is that, to use vtable programmer needs to use pointer explicitly everywhere. In C++, self is an implicit parameter and handled by compiler. Programmer doesn’t need to write out its signature Ptr{Void}. Also, pointer casting is needed from Ptr{Void} to the concrete types when you use the pointer.This type of thing cannot be abstracted away by designing a vtable library carefully.

lmiq · May 25, 2021, 6:08pm

Up to 20 the results are roughly the same:

result = 499726.0713257201
 with runtime dispatch:   137.170 ms (2000000 allocations: 30.52 MiB)
 with splitting:   15.002 ms (0 allocations: 0 bytes)
 with compiler splitting:   366.458 ms (6998980 allocations: 122.05 MiB)
 with runtime dispatch 3:   157.443 ms (2000000 allocations: 30.52 MiB)
 with compiler splitting 3:   1.437 ms (0 allocations: 0 bytes)

(@Elrod solution seems too good to be true, I have the feeling of being tricked by the compiler here)

Code

module Lines

export test
export Line1, Line2, Line3, Line4, Line5, Line6, Line7, Line8, Line9, Line10
       Line11, Line12, Line13, Line14, Line15, Line16, Line17, Line18, Line19, Line20
export LineUnion, Picture, Picture2

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end
struct Line6 <: LineAbstract length::Float64 end
struct Line7 <: LineAbstract length::Float64 end
struct Line8 <: LineAbstract length::Float64 end
struct Line9 <: LineAbstract length::Float64 end
struct Line10 <: LineAbstract length::Float64 end
struct Line11 <: LineAbstract length::Float64 end
struct Line12 <: LineAbstract length::Float64 end
struct Line13 <: LineAbstract length::Float64 end
struct Line14 <: LineAbstract length::Float64 end
struct Line15 <: LineAbstract length::Float64 end
struct Line16 <: LineAbstract length::Float64 end
struct Line17 <: LineAbstract length::Float64 end
struct Line18 <: LineAbstract length::Float64 end
struct Line19 <: LineAbstract length::Float64 end
struct Line20 <: LineAbstract length::Float64 end

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length
paint(l :: Line6) = l.length
paint(l :: Line7) = l.length
paint(l :: Line8) = l.length
paint(l :: Line9) = l.length
paint(l :: Line10) = l.length
paint(l :: Line11) = l.length
paint(l :: Line12) = l.length
paint(l :: Line13) = l.length
paint(l :: Line14) = l.length
paint(l :: Line15) = l.length
paint(l :: Line16) = l.length
paint(l :: Line17) = l.length
paint(l :: Line18) = l.length
paint(l :: Line19) = l.length
paint(l :: Line20) = l.length

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

const LineUnion = Union{Line1,Line2,Line3,Line4,Line5,
                        Line6,Line7,Line8,Line9,Line10,
                        Line11,Line12,Line13,Line14,Line15,
                        Line16,Line17,Line18,Line19,Line20}
struct Picture2{T<:LineUnion}
       lines::Vector{T}
end 

# Dynamical dispatch at runtime

function paint1(p)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

function paint3(p)
  s = 0.
  @inbounds for i in eachindex(p.lines)
    s += paint(p.lines[i])
  end
  s
end

# Union splitting

function paint2(p)
  s = 0.
  for l in p.lines
    if l isa Line1 s += paint(l)
    elseif l isa Line2 s += paint(l)
    elseif l isa Line3 s += paint(l)
    elseif l isa Line4 s += paint(l)
    elseif l isa Line5 s += paint(l)
    elseif l isa Line6 s += paint(l)
    elseif l isa Line7 s += paint(l)
    elseif l isa Line8 s += paint(l)
    elseif l isa Line9 s += paint(l)
    elseif l isa Line10 s += paint(l)
    elseif l isa Line11 s += paint(l)
    elseif l isa Line12 s += paint(l)
    elseif l isa Line13 s += paint(l)
    elseif l isa Line14 s += paint(l)
    elseif l isa Line15 s += paint(l)
    elseif l isa Line16 s += paint(l)
    elseif l isa Line17 s += paint(l)
    elseif l isa Line18 s += paint(l)
    elseif l isa Line19 s += paint(l)
    elseif l isa Line20 s += paint(l)
    end
  end
  s
end

function test(n)

  line_types = [ Line1, Line2, Line3, Line4, Line5, 
                 Line6, Line7, Line8, Line9, Line10, 
                 Line11, Line12, Line13, Line14, Line15, 
                 Line16, Line17, Line18, Line19, Line20 ]
  p = Picture([line_types[rand(1:20)](rand()) for i in 1:n])
  p2 = Picture2(convert(Vector{LineUnion},p.lines))

  @test paint1(p) ≈ paint2(p)
  println("result = ", paint1(p))

  print(" with runtime dispatch: "); @btime paint1($p) 
  print(" with splitting: "); @btime paint2($p) 
  print(" with compiler splitting: "); @btime paint1($p2) 
  print(" with runtime dispatch 3: "); @btime paint3($p) 
  print(" with compiler splitting 3: "); @btime paint3($p2) 

end

end

using .Lines
test(1_000_000)

I’ve added the computation of a generic function (here a sin) on the elements, to be sure I was not being tricked, and I am not:

result = 4600.886933656946
 with runtime dispatch:   1.091 ms (20000 allocations: 312.50 KiB)
 with splitting:   197.816 μs (0 allocations: 0 bytes)
 with compiler splitting:   2.801 ms (68980 allocations: 1.21 MiB)
 with runtime dispatch 3:   1.196 ms (20000 allocations: 312.50 KiB)
 with compiler splitting 3:   169.995 μs (0 allocations: 0 bytes)

Code

module Lines

export test
export Line1, Line2, Line3, Line4, Line5, Line6, Line7, Line8, Line9, Line10
       Line11, Line12, Line13, Line14, Line15, Line16, Line17, Line18, Line19, Line20
export LineUnion, Picture, Picture2

using BenchmarkTools
using Test

abstract type LineAbstract end

struct Line1 <: LineAbstract length::Float64 end
struct Line2 <: LineAbstract length::Float64 end
struct Line3 <: LineAbstract length::Float64 end
struct Line4 <: LineAbstract length::Float64 end
struct Line5 <: LineAbstract length::Float64 end
struct Line6 <: LineAbstract length::Float64 end
struct Line7 <: LineAbstract length::Float64 end
struct Line8 <: LineAbstract length::Float64 end
struct Line9 <: LineAbstract length::Float64 end
struct Line10 <: LineAbstract length::Float64 end
struct Line11 <: LineAbstract length::Float64 end
struct Line12 <: LineAbstract length::Float64 end
struct Line13 <: LineAbstract length::Float64 end
struct Line14 <: LineAbstract length::Float64 end
struct Line15 <: LineAbstract length::Float64 end
struct Line16 <: LineAbstract length::Float64 end
struct Line17 <: LineAbstract length::Float64 end
struct Line18 <: LineAbstract length::Float64 end
struct Line19 <: LineAbstract length::Float64 end
struct Line20 <: LineAbstract length::Float64 end

paint(l :: Line1) = l.length
paint(l :: Line2) = l.length
paint(l :: Line3) = l.length
paint(l :: Line4) = l.length
paint(l :: Line5) = l.length
paint(l :: Line6) = l.length
paint(l :: Line7) = l.length
paint(l :: Line8) = l.length
paint(l :: Line9) = l.length
paint(l :: Line10) = l.length
paint(l :: Line11) = l.length
paint(l :: Line12) = l.length
paint(l :: Line13) = l.length
paint(l :: Line14) = l.length
paint(l :: Line15) = l.length
paint(l :: Line16) = l.length
paint(l :: Line17) = l.length
paint(l :: Line18) = l.length
paint(l :: Line19) = l.length
paint(l :: Line20) = l.length

f(l) = sin(paint(l))

struct Picture{T<:LineAbstract}
       lines::Vector{T}
end 

const LineUnion = Union{Line1,Line2,Line3,Line4,Line5,
                        Line6,Line7,Line8,Line9,Line10,
                        Line11,Line12,Line13,Line14,Line15,
                        Line16,Line17,Line18,Line19,Line20}
struct Picture2{T<:LineUnion}
       lines::Vector{T}
end 

# Dynamical dispatch at runtime

function paint1(p,f)
  s = 0.
  for l in p.lines
    s += f(l)
  end
  s
end

function paint3(p,f)
  s = 0.
  @inbounds for i in eachindex(p.lines)
    s += f(p.lines[i])
  end
  s
end

# Union splitting

function paint2(p,f)
  s = 0.
  for l in p.lines
    if l isa Line1 s += f(l)
    elseif l isa Line2 s += f(l)
    elseif l isa Line3 s += f(l)
    elseif l isa Line4 s += f(l)
    elseif l isa Line5 s += f(l)
    elseif l isa Line6 s += f(l)
    elseif l isa Line7 s += f(l)
    elseif l isa Line8 s += f(l)
    elseif l isa Line9 s += f(l)
    elseif l isa Line10 s += f(l)
    elseif l isa Line11 s += f(l)
    elseif l isa Line12 s += f(l)
    elseif l isa Line13 s += f(l)
    elseif l isa Line14 s += f(l)
    elseif l isa Line15 s += f(l)
    elseif l isa Line16 s += f(l)
    elseif l isa Line17 s += f(l)
    elseif l isa Line18 s += f(l)
    elseif l isa Line19 s += f(l)
    elseif l isa Line20 s += f(l)
    end
  end
  s
end

function test(n)

  line_types = [ Line1, Line2, Line3, Line4, Line5, 
                 Line6, Line7, Line8, Line9, Line10, 
                 Line11, Line12, Line13, Line14, Line15, 
                 Line16, Line17, Line18, Line19, Line20 ]
  p = Picture([line_types[rand(1:20)](rand()) for i in 1:n])
  p2 = Picture2(convert(Vector{LineUnion},p.lines))

  @test paint1(p,f) ≈ paint2(p,f) ≈ paint3(p,f)
  println("result = ", paint1(p,f))

  print(" with runtime dispatch: "); @btime paint1($p,$f) 
  print(" with splitting: "); @btime paint2($p,$f) 
  print(" with compiler splitting: "); @btime paint1($p2,$f) 
  print(" with runtime dispatch 3: "); @btime paint3($p,$f) 
  print(" with compiler splitting 3: "); @btime paint3($p2,$f) 

end

end

using .Lines
test(10_000)

jzr · May 25, 2021, 7:33pm

I’m not sure if I’ve understood this… It seems like a difference between @lmiq’s original code and @Elrod’s faster code is related to modularity. In the original code, more Line types can be added dynamically by the user and they’ll be supported by paint(p). I think Elrods doesn’t support that yet(?) – is there a modification that would support dynamically adding new line types to the union?

kristoffer.carlsson · May 25, 2021, 8:23pm

I think the limit was removed in RFC: inference: remove union-split limit for linear signatures by vtjnash · Pull Request #37378 · JuliaLang/julia · GitHub.

This permits users to explicitly demand larger union products (for example, with type-asserts or field types) and still expect to get reliable union-splitting at any size in single-dispatch sites.

lmiq · May 26, 2021, 12:23am

I am not sure how easy is to understand everything that is going on. The easiest part is that if one initializes the vector with

p = Picture([line_types[rand(1:5)](rand()) for i in 1:1000]);

the container is of completely abstract line types, of course:

julia> typeof(p.lines)
Vector{Main.Lines.LineAbstract} (alias for Array{Main.Lines.LineAbstract, 1})

and then yes, to that vector we can push! a new type of line:

julia> push!(p.lines,NewLine(1.0))
1001-element Vector{Main.Lines.LineAbstract}:
 Line5(0.46793882745951554)
⋮
 NewLine(1.0)

this is more flexible than doing

julia> p = Picture2(Union{Line1,Line2,Line3,Line4,Line5}[line_types[rand(1:5)](rand()) for i in 1:1000]);

julia> typeof(p.lines)
Vector{LineUnion} (alias for Array{Union{Line1, Line2, Line3, Line4, Line5}, 1})

julia> push!(p.lines,NewLine(1.0))
ERROR: MethodError: Cannot `convert` an object of type 
  NewLine to an object of type 
  LineUnion

Yet, that difference in flexibility is only associated to the manipulation of the array, not with the function paint. That will work either way, but will specialize differently for each type of array. One could expect (I did) that with the Union type the compiler could do some sort of automatic splitting and be reasonably fast.

Yet that is more complicated:

julia> p = Picture([line_types[rand(1:5)](rand()) for i in 1:1000]);

julia> @btime Lines.paint1($p)
  55.803 μs (2000 allocations: 31.25 KiB)
504.7398735079236

julia> p = Picture2(Union{Line1,Line2,Line3,Line4,Line5}[line_types[rand(1:5)](rand()) for i in 1:1000]);

julia> @btime Lines.paint1($p)
  179.357 μs (5980 allocations: 109.06 KiB)
509.5267566245098

I would not expect the result above, on the contrary. paint1 there iterates with for l in p.lines, and I hoped that with the Union type it would be faster.

That changes completely if one iterates with for i in eachindex(p) (the paint3 function suggested by Elrod):

julia> @btime Lines.paint3($p)
  57.717 μs (2000 allocations: 31.25 KiB)
500.16546958495144

julia> p = Picture2(Union{Line1,Line2,Line3,Line4,Line5}[line_types[rand(1:5)](rand()) for i in 1:1000]);

julia> @btime Lines.paint3($p)
  984.929 ns (0 allocations: 0 bytes)
493.92273981135605

The performance with the Union type is not only faster, as expected. Is much faster than expected (it much faster than the C++ alternative as well).

So there is something related to the implementation of these iterators and how they correlate with the type of array that is being input that is important for the result here. That escapes me, but I think that suggests that some iterators could be improved (in particular the non-indexed one).

I will call attention for this fact: if one iterates with pairs, like this:

function paint4(p)
  s = 0.
  for (i,l) in pairs(p.lines)
    s += paint(l)
  end
  s
end

we get:

julia> @btime Lines.paint4($p)
  994.917 ns (0 allocations: 0 bytes)
493.92273981135605

while with

function paint1(p)
  s = 0.
  for l in p.lines
    s += paint(l)
  end
  s
end

we get:

julia> @btime Lines.paint1($p)
  176.550 μs (5980 allocations: 109.06 KiB)
493.92273981135605

I would, at this point, probably call this a bug on the iterators of the form for x in v, I cannot see how it can behave so much worse than for (i,x) in pairs(v), which is more general than the former.

Edit: I reported this as an issue: very bad performance of iterator on array of union types · Issue #40950 · JuliaLang/julia · GitHub

Elrod · May 26, 2021, 12:36am

Inference gives up on the dynamic dispatches of iterate:

julia> i, s = iterate(p2.lines)
(Line3(0.874075411120405), 2)

julia> i, s = iterate(p2.lines, s)
(Line1(0.10729680676556685), 3)

julia> @code_warntype iterate(p2.lines, s)
MethodInstance for iterate(::Vector{Union{Line1, Line2, Line3, Line4, Line5}}, ::Int64)
  from iterate(A::Array, i) in Base at array.jl:809
Arguments
  #self#::Core.Const(iterate)
  A::Vector{Union{Line1, Line2, Line3, Line4, Line5}}
  i::Int64
Locals
  val::Tuple{Union{Line1, Line2, Line3, Line4, Line5}, Int64}
Body::Union{Nothing, Tuple{Any, Any}}
1 ─       nothing
│         Core.NewvarNode(:(val))
│   %3  = (i % Base.UInt)::UInt64
│   %4  = (%3 - 1)::UInt64
│   %5  = Base.length(A)::Int64
│   %6  = (%4 < %5)::Bool
└──       goto #3 if not %6
2 ─       $(Expr(:inbounds, true))
│   %9  = Base.getindex(A, i)::Union{Line1, Line2, Line3, Line4, Line5}
│   %10 = (i + 1)::Int64
│         (val = Core.tuple(%9, %10))
│         $(Expr(:inbounds, :pop))
└──       return val
3 ─       return Base.nothing

Ends up inferring Tuple{Any,Any}.

lmiq · May 26, 2021, 12:37am

But is there is important reason for this iteration for x in v be implemented in a way that cannot be a simplified version of for (i,x) in pairs(v)?

Elrod · May 26, 2021, 12:40am

I’m not sure why it gives up on constructing the tuple. I.e. %9 is a union and %10 is an Int64, yet Core.tuple(%9, %10)::Tuple{Any,Any}.

Topic		Replies	Views
Performance drawback with subtyping Performance	34	3204	August 26, 2021
Avoiding Vectors of Abstract Types Performance question , type-stability	22	4428	February 17, 2022
Update on single dispatch benchmark comparison to C++ Performance cxx , benchmark	19	3058	February 13, 2021
[ANN] ManualDispatch.jl Package Announcements dispatch	15	2794	February 7, 2022
Slow code with Union and Box General Usage	17	298	March 6, 2024

Union splitting vs C++

Related topics