Improving performance elegantly - type stability

Good afternoon,

I have a couple of questions on how to improve performance. I read (a couple of times) the Performance Tips in the documentation but not everything is clear.
In fact, I became a little paranoid about “Type stability”, so that I multiplied all my functions when they cover Vector with float / Vector with complex / Array with float / Array with complex…

If this is indeed the right way to go, how could it be done in a more compact way. For instance, if I have the two following functions

function fun_vec(a, M)
        c = 0.
        for i in 1:length(M)
            c += a[i]*M[i]
        end
        return c
end

function fun_matrix(a, M)
        c = zeros(Float64, size(a, 2))
        for i in 1:length(M)
            @. c += a[i, :]*M[i]
        end
        return c
end

Can I make it “Type stable” with one function ? (when I tried, the command @code_warntype highlighted in red the body)

The same questions apply if I wanted to treat the case where a has its element Complex{Float64}, or if they are Interval (not sure if eltype will be detrimental). The latter is especially important for me since I must allow my functions to work with IntervalArithmetics.

I would say that the first question is “Can you make it (at all) with one function?”. I.e is there a way to have a generic code which, when called with arguments of different types, always computes what you want it to compute. In your case, since you broadcast some operations in the “matrix” case but not the “vector” case, none of your implementations work as intended for the other type (or maybe I did not understand what you intended to compute).

When you have a function, you can try to see whether it is type stable.


I’m not exactly sure what you’re trying to achieve here, but perhaps this is not too far from what you want:

function fun(a, m)
    c = zero(a[1,:])
    for i in 1:length(m)
        @. c += a[i, :]*m[i]
    end
    return c
end

Vector of Integers (note that it differs from your fun_vec implementation in that it returns a 1-element vector instead of a scalar)

julia> m = [1, 2, 3];
julia> v = [4, 5, 6];
julia> fun(v, m)
1-element Array{Int64,1}:
 32

Matrix of integers:

julia> a = [4 7;
            5 8;
            6 9];
julia> fun(a, m)
2-element Array{Int64,1}:
 32
 50

Vector of Floating-Points:

julia> m = Float64.(m);
julia> v = Float64.(v);
julia> fun(v, m)
1-element Array{Float64,1}:
 32.0

All these should be type-stable. For example in the last case where m and v are Float64 arrays:

julia> @code_warntype fun(v,m)
Body::Array{Float64,1}
2 1 ── %1   = (Base.arraysize)(a, 1)::Int64                                                          │╻╷╷╷              getindex
  │    %2   = (Base.slt_int)(%1, 0)::Bool                                                            ││╻╷╷╷              to_indices
  │           (Base.ifelse)(%2, 0, %1)                                                               │││┃││││             axes
  │           (Base.ifelse)(false, 0, 1)                                                             ││││╻╷╷╷╷             to_indices
  │    %5   = %new(Base.OneTo{Int64}, 1)::Base.OneTo{Int64}                                          │││││┃││               uncolon
  │    %6   = %new(Base.Slice{Base.OneTo{Int64}}, %5)::Base.Slice{Base.OneTo{Int64}}                 ││││││╻                 Type
  └───        goto #6 if not true                                                                    ││╻                 _getindex
  2 ── %8   = (Core.tuple)(1, %6)::Tuple{Int64,Base.Slice{Base.OneTo{Int64}}}                        │││               
  │    %9   = (Base.arraysize)(a, 1)::Int64                                                          │││╻╷╷               checkbounds
  │    %10  = (Base.slt_int)(%9, 0)::Bool                                                            ││││╻╷╷╷              checkbounds
  │    %11  = (Base.ifelse)(%10, 0, %9)::Int64                                                       │││││┃││││             axes
  │    %12  = (Base.sle_int)(1, 1)::Bool                                                             ││││││╻╷                checkindex
  │    %13  = (Base.sle_int)(1, %11)::Bool                                                           │││││││┃                 <=
  │    %14  = (Base.and_int)(%12, %13)::Bool                                                         │││││││╻                 &
  │           (Base.ifelse)(false, 0, 1)                                                             │││││││╻╷╷               Type
  │    %16  = (Base.and_int)(%14, true)::Bool                                                        ││││││╻                 &
  └───        goto #4 if not %16                                                                     ││││              
...

One of the key points here is to initialize c in the correct way, so that it has the adequate type (both in terms of array dimensionality and element type).

Then, I expect similar(a[1,:]) to work just fine as well. Hence, it should be also good for Interval. Great!

To be honest, I would not have expected this to be type stable. I must have a pretty bad intuition on how all the machinery works :frowning: . It seemed to me that the compiler could not predict the type since it depends on the input, that is why by splitting all my function I was making sure the type would be known.

And why use a single function when you can efficiently use the same function name for different (specialized) input shapes? This is referred to as Break functions into multiple definitions in the Performance Tips in the manual.

function sum_generic(a::Vector{T}, M::Vector{T}) where {T}
    c = zero(T)
    for i = 1:length(M)
        c += a[i]*M[i]
    end
    return c
end

function sum_generic(a::Matrix{T}, M::Vector{T}) where {T}
    c = zeros(T, size(a, 2))
    for j = 1:size(a, 2)
        for i = 1:length(M)
            c[j] += a[i,j]*M[i]
        end 
    end
    return c
end

using BenchmarkTools
a = rand(1000)
M = rand(1000)
@btime sum_generic($a, $M)

a = rand(1000,1000)
@btime sum_generic($a, $M)
3 Likes

Thank you for that reminder!

However, from @ffevotte answer, it seems that I gain nothing - in terms of performance - by writing it this way anyways. Nevertheless, as I said, I am clearly miss understanding how the compiler predicts the type of the output not knowing the input in advance and yet simply telling him “same type as the input”. I’ll keep reading!

These are the timings on my machine for my sum_generic and fun by @ffevotte for both cases of Vector or Matrix:

sum_generic  90.068 ns (0 allocations: 0 bytes)       # Vector
fun          26.428 μs (1002 allocations: 93.94 KiB)  # Vector
sum_generic  1.253 ms (1 allocation: 7.94 KiB)        # Matrix
fun          3.571 ms (1002 allocations: 7.77 MiB)    # Matrix

I tested with BenchmarkTools on my actual functions and indeed it is faster. I am wondering if there is a specific reason why you chose a double loop over the macro @. ?

EDIT: after some tests, it is faster to use the loop vs @.. Interesting I expected at most a minor difference, but no. Thank you for this too!

Just to be clear, the differences in performance that you are measuring here have nothing to do with breaking the function into multiple methods nor with type stability. They are mostly due to the use of a different algorithm, which accesses memory in the correct order (which is of course a very good thing to do).

Comparing the apples to apples, one gets very similar results:

function fun(a, m)
    c = zero(@view a[1,:])
    for j in 1:length(c)
        for i in 1:length(m)
            c[j] += a[i, j]*m[i]
        end
    end
    return c
end

function sum_generic(a::Vector{T}, M::Vector{T}) where {T}
    c = zero(T)
    for i in 1:length(M)
        c += a[i]*M[i]
    end
    return c
end
using BenchmarkTools

julia> @btime fun($v, $m)
  1.489 μs (3 allocations: 192 bytes)
1-element Array{Float64,1}:
 240.23659663578363

julia> @btime sum_generic($v, $m)
  1.422 μs (0 allocations: 0 bytes)
240.23659663578363

Now many things could probably be said about all these implementations. But the main takeaway ideas in my opinion would be:

  • type stability is an important (and subtle) matter
  • break your function into multiple methods if you want to implement different algorithms for the different types
  • the Julia compiler will optimize each method for the actual types that it gets, so that you need not duplicate the same algorithm for different types in order to get performance.
  • while focusing on design choices, you do not want to forget that, as in any language, algorithms and memory usage patterns are very important.
4 Likes

One thing I’d question in some of the implementations above, is the duplication of logic (vector dot product). If these two cases (vector and matrix) are supposed to use the same underlying logic, the safest way to achieve that is to use the same code (commonly known as DRY code).

For example, let’s say that you realize that the repeated summing leads to numerical stability issues, and you want to change to pairwise or Kahan summation. It’s much better then (less risk of bugs, less coding) to just have to change a single implementation, than multiple ones.

Perhaps it doesn’t always make sense to do this in practice due to performance reasons, but I think it’s worth trying hard (even sacrifice some performance), and in this case I think it can be done quite well:

@inline function sum_generic(a::AbstractVector{T}, M::AbstractVector{T}) where {T}
    c = zero(T)
    @inbounds @simd for i = 1:length(M)
        c += a[i]*M[i]
    end
    return c
end

function sum_generic(a::AbstractMatrix{T}, M::AbstractVector{T}) where {T}
    map(j -> sum_generic(view(a, :, j), M), 1:size(a,2))
end

By adding @inbounds and @simd, the code can also be sped up considerably. Sample timings:

julia> M = rand(1000); a = rand(1000);

julia> @btime sum_generic($a, $M);
  84.707 ns (0 allocations: 0 bytes)    # without simd: 1.138 μs

julia> M = rand(1000); a = rand(1000,1000);

julia> @btime sum_generic($a, $M);
  230.650 μs (3 allocations: 8.00 KiB)  # without simd: 1.130 ms
2 Likes

Of course if this is the real use case in my code, I’d rather use dot(M,a) for the Vector case and the insanely optimized transpose(M) * a for the Matrix case.

sum_generic    89.780 ns (0 allocations: 0 bytes)  # Vector 
Julia builtin  98.874 ns (0 allocations: 0 bytes)  # Vector    
sum_generic    1.260 ms (1 allocation: 7.94 KiB)   # Matrix
Julia builtin  72.872 μs (2 allocations: 7.95 KiB) # Matrix

Yes of course, that wasn’t my point. Dot product was just the example given in the OP. The point is to not duplicate the underlying logic. I think ffevottes solution is great in that regard, and should be the preferred way to approach the problem. But due to the double for loop, it’s hard to optimize. In this specific example, it meant that I wasn’t able to get it to vectorize (there might be a way, but I couldn’t find it). Splitting it into two methods then, with one being a thin wrapper forwarding to the other, still has the advantage of not duplicating logic, while being 5 - 14 times faster.

Btw, I find it odd that your original sum_generic ran in 90 ns. I have to add both @inbounds and @simd to get at that performance (otherwise I’m at ~1.1 μs). Do you get that speed without those annotations? May I ask what CPU and version of Julia you’re on?

1 Like

I think the benefit of your two-tier implementation above is that you naturally write it in such a way that the accumulator is a dedicated, temporary variable, which allows the compiler to make sure that there are no aliasing problems.

In the nested loops implementation, it is natural (but inefficient) to directly accumulate in c[j]. But this can easily be fixed to get the full benefit of @inbounds @simd:

function fun(a, m)
    c = zero(@view a[1,:])
    for j in 1:length(c)
        t = c[j]
        @inbounds @simd for i in 1:length(m)
            t += a[i, j]*m[i]
        end
        c[j] = t
    end
    return c
end
julia> m = rand(1000);
julia> a = rand(1000,1000);

julia> @btime fun($a, $m);
  400.348 μs (1 allocation: 7.94 KiB)

julia> @btime sum_generic($a, $m);
  422.276 μs (1003 allocations: 54.88 KiB)

PS: I think I copied your sum_generic implementation verbatim, but it seems to allocate on my system and not yours…

Ah, beautiful! It is still a bit slower for the vector case though (and returns a vector, not a scalar). Perhaps it’s the overhead from view (seems to be a fixed ~70 ns overhead):

julia> a = rand(100); M = rand(100);

julia> @btime sum_generic($a, $M)
  16.199 ns (0 allocations: 0 bytes)
26.837956881728676

julia> @btime fun($a, $M)
  85.927 ns (3 allocations: 192 bytes)
1-element Array{Float64,1}:
 26.837956881728676

julia> a = rand(1000); M = rand(1000);

julia> @btime sum_generic($a, $M)
  86.551 ns (0 allocations: 0 bytes)
253.61878159355567

julia> @btime fun($a, $M)
  154.657 ns (3 allocations: 192 bytes)
1-element Array{Float64,1}:
 253.61878159355567

Hmm, I’d guess that it inlined on my system and not yours? You could try with @inline in front of the first function sum_generic.

I have to admit I don’t see how to return a scalar without resorting to specializing at least one part of the function. Your implementation is much better in this respect.


That’s what I would think, too.


Yes, inlining the first method makes both implementations perform similarly, both in terms of run time and number of allocations.

You absolutely should not define different functions with different names, if they are all intended to do ‘the same thing’, just for different types. That completely throws the whole concept of ‘multiple dispatch’ out the window, which means discarding the central paradigm of Julia.

All in all, why not simply do

fun(a, M) = a' * M

?

That’s a totally reasonable thing to wonder about. It’s certainly true that predicting the output type without knowing the input types would be impossible. Fortunately, that’s not how Julia works!

Instead, what happens is that when you call your function with some input values, Julia compiles the correct, specialized version for the actual types of those values. At this point, the input types are known (because they’re just the types the function was actually called with), so the output type may be correctly inferred (assuming your function has an inferrable return type). That means that when you write:

function foo(x)
  x + 1
end

then the first time you call foo() with an Int argument, Julia will compile a version in which x::Int. And when you call foo() with a Matrix{Float64} argument, Julia will compile a specialized version in which x::Matrix{Float64}. You never need to explicitly write the input types of a function unless (a) you want to change which method is called for a given type or (b) you want to document the input type for future readers or users of the code.

Concretely, this means that there is absolutely no performance difference between writing:

function foo(x::Int)
  ...
end

function foo(x::Float64)
  ...
end

instead of

function foo_int(x::Int64)
  ...
end

function foo_float(x::Float64)
  ...
end

So we strongly recommend you do the former (just writing two different methods for foo()) as long as those methods are logically doing the same thing.

Futhermore, there’s also no performance penalty to writing:

function foo(x)
  ...
end

i.e. leaving the type of x unspecified in the function signature. Julia will still compile the correct, specialized version for any x values you pass.

I think that by focusing so much on type stability and performance instead of understanding the basics of multiple dispatch, you are putting the cart before the horse.

First, get familiar and comfortable with using multiple dispatch and try to write simple and generic code with just the type annotations you need. Then you can start worrying about type stability.

1 Like

@DNF Indeed, I might have misunderstood multiple dispatch.
Perhaps it was obvious that functions such as eltype when initializing the variable would not hurt the performance (thank you @rdeits for your post).

Yes, I enable optimization switches --inline=yes and --check-bounds=no during compilation, I should’ve made it explicit, though.

Also, I see that writing multiple definitions instead of a single function is the way to go if one cares about performance. This enables Julia’s multiple dispatch system to compile optimized versions of the same function for different argument types.

That’s totally right, the Matrix case can be coded such that it uses the Vector case within a single loop, but again, the compiler is still not able to remove the extra allocations and the result is a small performance degradation. Here is a comparison:

# wrapper around the Vector case
function sum_generic(a::Matrix{T}, M::Vector{T}) where {T}
    c = zeros(T, size(a, 2))
    for j = 1:size(a, 2)
        c[j] += sum_generic(a[:,j], M)
    end
    return c
end

which times:

  1.257 ms (1 allocation: 7.94 KiB)     # double-loop
  1.283 ms (1001 allocations: 7.76 MiB) # wrapper around Vector

That’s because you’re copying each column by doing a[:,j]. An easy way to avoid those allocations is to use a view. Did you see my post above? There’s a self-contained example and benchmark there using a view.

Even if there was a small performance degradation (2 % in your benchmark), that’s a small price to pay to avoid duplicating logic, which in my experience inevitably leads to higher maintenance costs in the future, and frequently bugs/inconsistencies between the different implementations.

2 Likes