Impact of specifying input types on function performance, when referencing arguments from Vector{Any}

I have a performance issue I’d like to solve. I need to reference some function arguments from a Vector{Any} array. Most of the time this is incurring an expensive allocation penalty. However, when I call an explicitly typed function, the allocation doesn’t happen, and the code runs much faster. I’ve written a MWE to demonstrate the issue. My problem is that unfortunately I cannot specify the input types to the degree required to save the allocation, so I’m wondering why it happens in some cases, and whether it can be avoided.

Here’s the MWE, and the output it produces:

using BenchmarkTools

struct TypeA{T<:Real}

struct TypeB{T<:Real}

function myfun1(a, b)
    return a.a * b.a + b.b

function myfun2(a::TypeA, b::TypeB)
    return a.a * b.a + b.b

function myfun3(a::TypeA{T}, b::TypeB{T}) where T <: Real
    return a.a * b.a + b.b

function myfun4(a::TypeA{Float64}, b::TypeB{Float64})
    return a.a * b.a + b.b

a = TypeA(1.)
b = TypeB(1.,2.)
v = [a, b]

@btime myfun1($a, $b)
@btime myfun1($v[1], $v[2])
@btime myfun2($v[1], $v[2])
@btime myfun3($v[1], $v[2])
@btime myfun4($v[1], $v[2])
# Added after post from mikmoore:
@btime myfun1($v[1]::TypeA, $v[2]::TypeB)
@btime myfun1($v[1]::TypeA{Float64}, $v[2]::TypeB{Float64})

julia> 2.125 ns (0 allocations: 0 bytes)
julia> 17.869 ns (1 allocation: 16 bytes)
julia> 19.079 ns (1 allocation: 16 bytes)
julia> 18.120 ns (1 allocation: 16 bytes)
julia> 3.333 ns (0 allocations: 0 bytes)
julia> 18.788 ns (1 allocation: 16 bytes)
julia> 2.416 ns (0 allocations: 0 bytes)

What’s also strange is that in my actual use case, parameterizing similar to myfun3 above is sufficient to avoid the allocations. Nevertheless, this is too constraining for my use case.

As long as you do this:

You’ll have problems:

Your best approach would be to make it a vector of a concrete type.

Unfortunately that’s not an option for me. What I don’t understand is if the allocation can be avoided in some circumstances, why can’t it be avoided in all of the ones I coded. After all, the code should be doing the same thing in all cases.

Note also that the arguments made in the post you link to do not apply to my use case. I do not care about the performance of constructing or manipulating the vector. I only care about the performance of a function that uses some of the entries. We know, from the last case in my MWE, that the lookup time of those entries is not a problem. The problem is the allocation. And we also know that it’s possible for the compiler to avoid this allocation.

I’m also curious and watching here to understand why myfun4 shows better performance. But you’ll still pay the price for unwrapping the abstract type somewhere.

I didn’t know that Julia was smart enough to make myfun4 use static dispatch with uncertain inputs!

I’m fairly confident that the reason myfun4 works well here is that Julia knows that the inputs are either those precise concrete types or that it can throw an error. For all the others, there are an infinite number of possible input types and so dynamic dispatch is required.

I’m not sure this will work in your actual use case, but concrete Union-typed arrays perform very well using union splitting (automatic if-else logic over a limited number of known types). In this example you could construct this using v = Union{typeof(a),typeof(b)}[a, b] and should see most of the performance recovered. A tuple can be an inhomogeneous container and could also work here v = (a, b), but again may not work in your real application.

1 Like

Your answer seems to imply that dynamic dispatch implies allocation (at least in this case). Is that true? I don’t understand the reason for this. Can you share any references?

Yes. Dynamic dispatch always (almost always?) results in allocations. I don’t have a reference handy but perhaps someone else has one.

Interestingly (or somewhat expectedly), if you add more “concrete” definitions of myfun4, you lose its advantage with respect to the other functions:

for T ∈ [Float64, Float32, Int32, Int64, BigFloat, BigInt]
    @eval function myfun4(a::TypeA{$T}, b::TypeB{$T})
        return a.a * b.a + b.b

julia> @btime myfun3($v[1], $v[2])
  32.393 ns (1 allocation: 16 bytes)

julia> @btime myfun4($v[1], $v[2])
  33.348 ns (1 allocation: 16 bytes)

In practice, if your v vector is not too long, it might be worth it to use a tuple instead.

1 Like

Thanks for the suggestion of using a vector of unions. However, it comes with a significant time penalty in my actual use case. When I use strict function parameterization to avoid allocations, it makes the whole function call 2-4x slower than just using Vector{Any}.

Interesting. I’d really like to understand the reason for the allocation. I don’t intuitively see a need for it, even in the case of dynamic dispatch.

The vector length is variable and will potentially run into the millions of entries, so a tuple is not an option.

I could imagine the allocation is a result of needing a boxed output for the return value. But I think it would be there even without that need. The reasons for allocations are deeper in the Julia innards than most of us ever venture. I’m sure they would have been avoided if possible. But someone may eventually pop in here with a concrete answer. You could start looking through the “Developer Documentation” sections of the docs in the meantime.

Does your situation permit you to provide callsite type annotations to your function calls?

julia> @btime myfun4($v[1]::TypeA{Float64}, $v[2]::TypeB{Float64})
  3.700 ns (0 allocations: 0 bytes)

It’s not very common that one has a giant array with many types whose elements need to be dumped arbitrarily into functions (or, at least, not while dynamic dispatch overhead is also a concern). Usually you have some notion of what will go where. Perhaps there is a way to restructure your code to avoid this issue? It’s hard to know without more details.

1 Like

I didn’t know about callsite annotations. Thank you! I updated the question to reflect the impact of that. Unfortunately in my case it’s difficult because the types can either be some version of a float, or an autodiff value (e.g. when using ForwardDiff), and you only know at runtime what the combination of types will be.

I’ve been thinking about data storage, and it may be possible to have a general interface that still allows for problem-specific data storage. Thanks for the prod in that direction.

If you would like a MWE:

julia> function f(a,b)
           return a * b 
f (generic function with 1 method)

julia> v = Real[1.0, 1.0];

julia> @btime f($v[1], $v[2])
  12.491 ns (1 allocation: 16 bytes)

It is strange to me. Maybe that MWE is a benchmarking artifact? There is no inference issue there:

julia> @code_warntype f(v[1],v[2])
MethodInstance for f(::Float64, ::Float64)
  from f(a, b) in Main at REPL[5]:1
1 ─ %1 = (a * b)::Float64
└──      return %1

Or, alternatively, it is a benchmarking feature, because the true use case would maybe be something more similar to:

julia> g(v) = f(v[1],v[2])
g (generic function with 1 method)

julia> @btime g($v)
  12.740 ns (1 allocation: 16 bytes)

in which case it is clear that the output type cannot be deduced from types of v, and must be boxed.

1 Like

Thanks. That’s a nice, compact MWE.

I don’t know what a boxed output is. However, if I specify the output type of g(), it makes no difference:

using BenchmarkTools
f(a, b) = a * b
g(v)::Float64 = f(v[1],v[2])
v = Real[1., 1.]
@btime f($v[1], $v[2])
@btime g($v)

julia> 45.538 ns (1 allocation: 16 bytes)
julia> 119.119 ns (1 allocation: 16 bytes)

I don’t understand why this is so slow, even with the allocation. It’s much slower than my previous example.

Specifying the output type won´t make any difference, because those type annotations are equivalent to calls to convert. That is, you are doing the equivalent of:

julia> f(a,b) = a*b
f (generic function with 1 method)

julia> _g(v) = f(v[1],v[2])
_g (generic function with 1 method)

julia> g(v) = convert(Float64,_g(v))
g (generic function with 1 method)

That doesn’t change the fact that _g(v) is type-unstable. The output-type annotations are useful for the instability to be localized at that call, if you are using the output of g afterwards. Then, you are providing information for the compiler on what it can do with the result of g.

“Boxing” (afaiu), is simply that since the type of the output of g cannot be inferred at compile time (because the elements of v can be of any type of Real), the compiler needs to store the output of g as a pointer to a “box” (a position in memory) which will be allocated at run time, depending on the output type. Thus, it is more or less a synonym of “allocating”, in the sense that a new place in memory must be found to store the result, at run time.

Concerning the benchmark, I think there is a tricky thing there, which is a property of how BenchmarkTools deals with the input variables:

julia> @btime f($v[1],$v[2])
  12.964 ns (1 allocation: 16 bytes)

julia> @btime f($(v[1]),$(v[2]))
  1.533 ns (0 allocations: 0 bytes)

Thus, without the parenthesis of the second example, the lowering of the function is similar to something as the definition of g above. With the parenthesis the interpolation occurs on the result of getindex, and then it is type-stable. That is to say that from f you won´t get much information about this type instability, because there is any there, it is on the context of how f is called where the instability is manifested.

1 Like

OK, this was a bottom up attempt to fix the performance issues I’m seeing - dive into the code to see where allocations happen, and fix them. I’ve decided to also take a top down approach - describe the type of program I’m trying to write, and see what architectural approaches have already been used in this case. If you’re interested, you can read about it here.

Indeed, the problem boils down to having a vector typed with an abstract type, in this case. In some situations that can be avoided by using a tuple (if the vector is small). In other cases that is inherently difficult to solve (operating on containers of abstract types is the hardest thing to optimize, probably), and require sometimes rethinking the structure of the problem to avoid those containers at all.

One of these threads can lead you to long discussions about this last kind of problem: Performance drawback with subtyping

(there is no easy and general solution)

1 Like