Understanding the performance of trait based dispatch together with type annotations

I’m trying to write performant code, and in general opted for adding type annotations to help the compiler, since writing general code is often not very important in my code.

I was currently trying to implement trait-based dispatch (as discussed here) in some of my functions, but I’ve been a bit confused with its interaction with type annotations.

I’ve made the following minimal example:

using LinearAlgebra

abstract type FunctionType end

struct F1 <: FunctionType end
struct F2 <: FunctionType end

struct Vecs
    x::Vector{Float64}
    y::Vector{Float64}
end

const vecs = Vecs([1,2],[2,3])


function combine(x, y, ::Type{F1})
    return norm(x) + norm(y)
end

function combine(x, y, ::Type{F2})
    return norm(x)^2 + norm(y)^2
end

combine(v::Vecs, t::Type{FT}) where FT <: FunctionType = combine(v.x, v.y, t)

function test1(v, t)
    ti = time()
    for _ in 1:10^8
        combine(v, t)
    end
    tf = time()
    tf-ti
end
test1(vecs, F1) # returns 10.300249099731445

function test2(v, t::Type{FT}) where FT <: FunctionType
    ti = time()
    for _ in 1:10^8
        combine(v, t)
    end
    tf = time()
    tf-ti
end
test2(vecs, F1) # returns 1.391638994216919

Here I assumed the test function would act as a function barrier, allowing the compiler to specialize the function within the loop so that it’s already known which combine function to take. However, it seems that this is only the case when I also type annotate the t. I don’t understand why this matters because the type annotation is still not a concrete type, so I thought this should really change anything about how the function is compiled.

However, if I now drop the type annotation for combine for the second argument of combine:

combine(v::Vecs, t) = combine(v.x, v.y, t)
test1(vecs, F1) # returns 1.5857880115509033
test2(vecs, F1) # returns 1.5145998001098633

Now of course the times fluctuate a little bit but they are consistent. test2 is slightly slower than before, but it is significantly faster than before, though still a bit slower than test1.

Then, if I do no type annotations for combine:

combine(v, t) = combine(v.x, v.y, t)
test1(vecs, F1) # returns 1.5554089546203613
test2(vecs, F1) # returns 1.3924591541290283

So suddenly basically for test1 nothing really seems to have changed, but test2 is again as fast as it was with full annotations. I’m very confused by this because I don’t understand why removing the type annotation for v seem to have an interaction with the function test2, where I annotated t.

Can anyone help interpret these results?

Also, I didn’t use @benchmark since it changes the scoping of the variables, so it was a bit hard to get the results I wanted to show (which seem to correspond to what I’m seeing in terms of real execution time differences that I’m seeing.)

Edit: This is all in Julia 1.9

Be aware of when Julia avoids specializing.. Functions do not specialize on a Type, Function, or Vararg argument if it’s only passed as arguments into other functions, and you make them specialize by specifying method parameters. Your combine does that, but because test1 didn’t specialize on t, it still needed to check t and dispatch combine(v, t) at runtime. Using @time shows that test1 in the first example does millions of allocations, likely to accommodate an unknown return type from runtime dispatch.

In the versions omitting t::Type{FT} in combine, test1 has the same performance as test2**. However, if you contrivingly add more FunctionTypes and combines up to F4, the test1s will also allocate millions and degrade to similar performance as the first example. When a function has fewer methods than a max_methods cutoff, the compiler tries to elide runtime dispatch by finding the one or the few return types, but it stopped doing this with t::Type{FT} present. That may be because parametric methods specify unlimited methods, but I’ve been able to rescue this optimization with a return type annotation in another case (though not yours). I guess this shows that cutoff-based optimizations are nice but can be brittle, especially if not documented plainly.

**I wouldn’t make anything of the smaller timing differences. Timing variance from background processes can easily add up to <30% differences over so many iterations. @benchmark displays a distribution of timings.

4 Likes