# Improving performance elegantly - type stability

I would say that the first question is “Can you make it (at all) with one function?”. I.e is there a way to have a generic code which, when called with arguments of different types, always computes what you want it to compute. In your case, since you broadcast some operations in the “matrix” case but not the “vector” case, none of your implementations work as intended for the other type (or maybe I did not understand what you intended to compute).

When you have a function, you can try to see whether it is type stable.

I’m not exactly sure what you’re trying to achieve here, but perhaps this is not too far from what you want:

``````function fun(a, m)
c = zero(a[1,:])
for i in 1:length(m)
@. c += a[i, :]*m[i]
end
return c
end
``````

Vector of Integers (note that it differs from your `fun_vec` implementation in that it returns a 1-element vector instead of a scalar)

``````julia> m = [1, 2, 3];
julia> v = [4, 5, 6];
julia> fun(v, m)
1-element Array{Int64,1}:
32
``````

Matrix of integers:

``````julia> a = [4 7;
5 8;
6 9];
julia> fun(a, m)
2-element Array{Int64,1}:
32
50
``````

Vector of Floating-Points:

``````julia> m = Float64.(m);
julia> v = Float64.(v);
julia> fun(v, m)
1-element Array{Float64,1}:
32.0
``````

All these should be type-stable. For example in the last case where `m` and `v` are `Float64` arrays:

``````julia> @code_warntype fun(v,m)
Body::Array{Float64,1}
2 1 ── %1   = (Base.arraysize)(a, 1)::Int64                                                          │╻╷╷╷              getindex
│    %2   = (Base.slt_int)(%1, 0)::Bool                                                            ││╻╷╷╷              to_indices
│           (Base.ifelse)(%2, 0, %1)                                                               │││┃││││             axes
│           (Base.ifelse)(false, 0, 1)                                                             ││││╻╷╷╷╷             to_indices
│    %5   = %new(Base.OneTo{Int64}, 1)::Base.OneTo{Int64}                                          │││││┃││               uncolon
│    %6   = %new(Base.Slice{Base.OneTo{Int64}}, %5)::Base.Slice{Base.OneTo{Int64}}                 ││││││╻                 Type
└───        goto #6 if not true                                                                    ││╻                 _getindex
2 ── %8   = (Core.tuple)(1, %6)::Tuple{Int64,Base.Slice{Base.OneTo{Int64}}}                        │││
│    %9   = (Base.arraysize)(a, 1)::Int64                                                          │││╻╷╷               checkbounds
│    %10  = (Base.slt_int)(%9, 0)::Bool                                                            ││││╻╷╷╷              checkbounds
│    %11  = (Base.ifelse)(%10, 0, %9)::Int64                                                       │││││┃││││             axes
│    %12  = (Base.sle_int)(1, 1)::Bool                                                             ││││││╻╷                checkindex
│    %13  = (Base.sle_int)(1, %11)::Bool                                                           │││││││┃                 <=
│    %14  = (Base.and_int)(%12, %13)::Bool                                                         │││││││╻                 &
│           (Base.ifelse)(false, 0, 1)                                                             │││││││╻╷╷               Type
│    %16  = (Base.and_int)(%14, true)::Bool                                                        ││││││╻                 &
└───        goto #4 if not %16                                                                     ││││
...
``````

One of the key points here is to initialize `c` in the correct way, so that it has the adequate type (both in terms of array dimensionality and element type).

Then, I expect `similar(a[1,:])` to work just fine as well. Hence, it should be also good for Interval. Great!

To be honest, I would not have expected this to be type stable. I must have a pretty bad intuition on how all the machinery works . It seemed to me that the compiler could not predict the type since it depends on the input, that is why by splitting all my function I was making sure the type would be known.

And why use a single function when you can efficiently use the same function name for different (specialized) input shapes? This is referred to as Break functions into multiple definitions in the Performance Tips in the manual.

``````function sum_generic(a::Vector{T}, M::Vector{T}) where {T}
c = zero(T)
for i = 1:length(M)
c += a[i]*M[i]
end
return c
end

function sum_generic(a::Matrix{T}, M::Vector{T}) where {T}
c = zeros(T, size(a, 2))
for j = 1:size(a, 2)
for i = 1:length(M)
c[j] += a[i,j]*M[i]
end
end
return c
end

using BenchmarkTools
a = rand(1000)
M = rand(1000)
@btime sum_generic(\$a, \$M)

a = rand(1000,1000)
@btime sum_generic(\$a, \$M)
``````
3 Likes

Thank you for that reminder!

However, from @ffevotte answer, it seems that I gain nothing - in terms of performance - by writing it this way anyways. Nevertheless, as I said, I am clearly miss understanding how the compiler predicts the type of the output not knowing the input in advance and yet simply telling him “same type as the input”. I’ll keep reading!

These are the timings on my machine for my `sum_generic ` and `fun` by @ffevotte for both cases of Vector or Matrix:

``````sum_generic  90.068 ns (0 allocations: 0 bytes)       # Vector
fun          26.428 μs (1002 allocations: 93.94 KiB)  # Vector
sum_generic  1.253 ms (1 allocation: 7.94 KiB)        # Matrix
fun          3.571 ms (1002 allocations: 7.77 MiB)    # Matrix
``````

I tested with BenchmarkTools on my actual functions and indeed it is faster. I am wondering if there is a specific reason why you chose a double loop over the macro `@.` ?

EDIT: after some tests, it is faster to use the loop vs `@.`. Interesting I expected at most a minor difference, but no. Thank you for this too!

Just to be clear, the differences in performance that you are measuring here have nothing to do with breaking the function into multiple methods nor with type stability. They are mostly due to the use of a different algorithm, which accesses memory in the correct order (which is of course a very good thing to do).

Comparing the apples to apples, one gets very similar results:

``````function fun(a, m)
c = zero(@view a[1,:])
for j in 1:length(c)
for i in 1:length(m)
c[j] += a[i, j]*m[i]
end
end
return c
end

function sum_generic(a::Vector{T}, M::Vector{T}) where {T}
c = zero(T)
for i in 1:length(M)
c += a[i]*M[i]
end
return c
end
``````
``````using BenchmarkTools

julia> @btime fun(\$v, \$m)
1.489 μs (3 allocations: 192 bytes)
1-element Array{Float64,1}:
240.23659663578363

julia> @btime sum_generic(\$v, \$m)
1.422 μs (0 allocations: 0 bytes)
240.23659663578363
``````

Now many things could probably be said about all these implementations. But the main takeaway ideas in my opinion would be:

• type stability is an important (and subtle) matter
• break your function into multiple methods if you want to implement different algorithms for the different types
• the Julia compiler will optimize each method for the actual types that it gets, so that you need not duplicate the same algorithm for different types in order to get performance.
• while focusing on design choices, you do not want to forget that, as in any language, algorithms and memory usage patterns are very important.
4 Likes

One thing I’d question in some of the implementations above, is the duplication of logic (vector dot product). If these two cases (vector and matrix) are supposed to use the same underlying logic, the safest way to achieve that is to use the same code (commonly known as DRY code).

For example, let’s say that you realize that the repeated summing leads to numerical stability issues, and you want to change to pairwise or Kahan summation. It’s much better then (less risk of bugs, less coding) to just have to change a single implementation, than multiple ones.

Perhaps it doesn’t always make sense to do this in practice due to performance reasons, but I think it’s worth trying hard (even sacrifice some performance), and in this case I think it can be done quite well:

``````@inline function sum_generic(a::AbstractVector{T}, M::AbstractVector{T}) where {T}
c = zero(T)
@inbounds @simd for i = 1:length(M)
c += a[i]*M[i]
end
return c
end

function sum_generic(a::AbstractMatrix{T}, M::AbstractVector{T}) where {T}
map(j -> sum_generic(view(a, :, j), M), 1:size(a,2))
end
``````

By adding `@inbounds` and `@simd`, the code can also be sped up considerably. Sample timings:

``````julia> M = rand(1000); a = rand(1000);

julia> @btime sum_generic(\$a, \$M);
84.707 ns (0 allocations: 0 bytes)    # without simd: 1.138 μs

julia> M = rand(1000); a = rand(1000,1000);

julia> @btime sum_generic(\$a, \$M);
230.650 μs (3 allocations: 8.00 KiB)  # without simd: 1.130 ms
``````
2 Likes

Of course if this is the real use case in my code, I’d rather use `dot(M,a)` for the Vector case and the insanely optimized `transpose(M) * a` for the Matrix case.

``````sum_generic    89.780 ns (0 allocations: 0 bytes)  # Vector
Julia builtin  98.874 ns (0 allocations: 0 bytes)  # Vector
sum_generic    1.260 ms (1 allocation: 7.94 KiB)   # Matrix
Julia builtin  72.872 μs (2 allocations: 7.95 KiB) # Matrix
``````

Yes of course, that wasn’t my point. Dot product was just the example given in the OP. The point is to not duplicate the underlying logic. I think `ffevotte`s solution is great in that regard, and should be the preferred way to approach the problem. But due to the double for loop, it’s hard to optimize. In this specific example, it meant that I wasn’t able to get it to vectorize (there might be a way, but I couldn’t find it). Splitting it into two methods then, with one being a thin wrapper forwarding to the other, still has the advantage of not duplicating logic, while being 5 - 14 times faster.

Btw, I find it odd that your original `sum_generic` ran in 90 ns. I have to add both `@inbounds` and `@simd` to get at that performance (otherwise I’m at ~1.1 μs). Do you get that speed without those annotations? May I ask what CPU and version of Julia you’re on?

1 Like

I think the benefit of your two-tier implementation above is that you naturally write it in such a way that the accumulator is a dedicated, temporary variable, which allows the compiler to make sure that there are no aliasing problems.

In the nested loops implementation, it is natural (but inefficient) to directly accumulate in `c[j]`. But this can easily be fixed to get the full benefit of `@inbounds @simd`:

``````function fun(a, m)
c = zero(@view a[1,:])
for j in 1:length(c)
t = c[j]
@inbounds @simd for i in 1:length(m)
t += a[i, j]*m[i]
end
c[j] = t
end
return c
end
``````
``````julia> m = rand(1000);
julia> a = rand(1000,1000);

julia> @btime fun(\$a, \$m);
400.348 μs (1 allocation: 7.94 KiB)

julia> @btime sum_generic(\$a, \$m);
422.276 μs (1003 allocations: 54.88 KiB)
``````

PS: I think I copied your `sum_generic` implementation verbatim, but it seems to allocate on my system and not yours…

Ah, beautiful! It is still a bit slower for the vector case though (and returns a vector, not a scalar). Perhaps it’s the overhead from `view` (seems to be a fixed ~70 ns overhead):

``````julia> a = rand(100); M = rand(100);

julia> @btime sum_generic(\$a, \$M)
16.199 ns (0 allocations: 0 bytes)
26.837956881728676

julia> @btime fun(\$a, \$M)
85.927 ns (3 allocations: 192 bytes)
1-element Array{Float64,1}:
26.837956881728676

julia> a = rand(1000); M = rand(1000);

julia> @btime sum_generic(\$a, \$M)
86.551 ns (0 allocations: 0 bytes)
253.61878159355567

julia> @btime fun(\$a, \$M)
154.657 ns (3 allocations: 192 bytes)
1-element Array{Float64,1}:
253.61878159355567
``````

Hmm, I’d guess that it inlined on my system and not yours? You could try with `@inline` in front of the first `function sum_generic`.

I have to admit I don’t see how to return a scalar without resorting to specializing at least one part of the function. Your implementation is much better in this respect.

That’s what I would think, too.

Yes, inlining the first method makes both implementations perform similarly, both in terms of run time and number of allocations.

You absolutely should not define different functions with different names, if they are all intended to do ‘the same thing’, just for different types. That completely throws the whole concept of ‘multiple dispatch’ out the window, which means discarding the central paradigm of Julia.

All in all, why not simply do

``````fun(a, M) = a' * M
``````

?

That’s a totally reasonable thing to wonder about. It’s certainly true that predicting the output type without knowing the input types would be impossible. Fortunately, that’s not how Julia works!

Instead, what happens is that when you call your function with some input values, Julia compiles the correct, specialized version for the actual types of those values. At this point, the input types are known (because they’re just the types the function was actually called with), so the output type may be correctly inferred (assuming your function has an inferrable return type). That means that when you write:

``````function foo(x)
x + 1
end
``````

then the first time you call `foo()` with an `Int` argument, Julia will compile a version in which `x::Int`. And when you call `foo()` with a `Matrix{Float64}` argument, Julia will compile a specialized version in which `x::Matrix{Float64}`. You never need to explicitly write the input types of a function unless (a) you want to change which method is called for a given type or (b) you want to document the input type for future readers or users of the code.

Concretely, this means that there is absolutely no performance difference between writing:

``````function foo(x::Int)
...
end

function foo(x::Float64)
...
end
``````

``````function foo_int(x::Int64)
...
end

function foo_float(x::Float64)
...
end
``````

So we strongly recommend you do the former (just writing two different methods for `foo()`) as long as those methods are logically doing the same thing.

Futhermore, there’s also no performance penalty to writing:

``````function foo(x)
...
end
``````

i.e. leaving the type of `x` unspecified in the function signature. Julia will still compile the correct, specialized version for any `x` values you pass.

I think that by focusing so much on type stability and performance instead of understanding the basics of multiple dispatch, you are putting the cart before the horse.

First, get familiar and comfortable with using multiple dispatch and try to write simple and generic code with just the type annotations you need. Then you can start worrying about type stability.

1 Like

@DNF Indeed, I might have misunderstood multiple dispatch.
Perhaps it was obvious that functions such as `eltype` when initializing the variable would not hurt the performance (thank you @rdeits for your post).

Yes, I enable optimization switches `--inline=yes` and `--check-bounds=no` during compilation, I should’ve made it explicit, though.

Also, I see that writing multiple definitions instead of a single function is the way to go if one cares about performance. This enables Julia’s multiple dispatch system to compile optimized versions of the same function for different argument types.

That’s totally right, the Matrix case can be coded such that it uses the Vector case within a single loop, but again, the compiler is still not able to remove the extra allocations and the result is a small performance degradation. Here is a comparison:

``````# wrapper around the Vector case
function sum_generic(a::Matrix{T}, M::Vector{T}) where {T}
c = zeros(T, size(a, 2))
for j = 1:size(a, 2)
c[j] += sum_generic(a[:,j], M)
end
return c
end
``````

which times:

``````  1.257 ms (1 allocation: 7.94 KiB)     # double-loop
1.283 ms (1001 allocations: 7.76 MiB) # wrapper around Vector
``````

That’s because you’re copying each column by doing `a[:,j]`. An easy way to avoid those allocations is to use a `view`. Did you see my post above? There’s a self-contained example and benchmark there using a view.

Even if there was a small performance degradation (2 % in your benchmark), that’s a small price to pay to avoid duplicating logic, which in my experience inevitably leads to higher maintenance costs in the future, and frequently bugs/inconsistencies between the different implementations.

2 Likes

I agree. I’m curious though, why `map` was able to vectorize correctly while loops weren’t?