Understanding specialized methods for field access

Sevi · February 6, 2023, 12:57pm

Hi!

Since this is my first post, let me say that I have worked with Julia for some work and hobby projects until now, and that I really like the language and its possibilities – and its amazing community, sharing their insights and providing resources to learn. Still, I occasionally stumble upon some things which I would like to understand better, mainly to improve the design/performance of my code, but also to just learn more about Julia’s internals and programming as a whole. I hope I’ve come to the right place with this kind of question.

In this (somewhat contrived) example code, I’m not sure why the performance differs significantly between the two implementations of an accessor function (one with a specialized method for the concrete type and one without):

using BenchmarkTools

abstract type AbstractElement end

struct ConcreteElement <: AbstractElement
    D::Float64
end

struct UnstableStruct
    v::Vector{AbstractElement}
end

get_D_generic(el)::Float64 = el.D
get_D_specialized(el)::Float64 = el.D
get_D_specialized(el::ConcreteElement) = el.D

function copy_values_to_vector1(s::UnstableStruct)
    result = zeros(length(s.v))
    for i in eachindex(s.v)
        result[i] = get_D_generic(s.v[i])
    end
    return result
end

function copy_values_to_vector2(s::UnstableStruct)
    result = zeros(length(s.v))
    for i in eachindex(s.v)
        result[i] = get_D_specialized(s.v[i])
    end
    return result
end

function main()
    elements = ConcreteElement.(rand(1000))
    s = UnstableStruct(elements)

    result1 = @btime copy_values_to_vector1($s)
    result2 = @btime copy_values_to_vector2($s)

    @assert result1 == result2
end

main()

which produces on my machine

  12.914 μs (1001 allocations: 23.56 KiB)
  1.462 μs (1 allocation: 7.94 KiB)

As far as I could gather from the docs, searching the forums, and looking at the @code_xx macros, this is what’s going on:

as soon as a function is called within the code, a specialized method will be compiled for the concrete types of the arguments, so the generated code for get_D_generic(::ConcreteElement) and get_D_specialized(::ConcreteElement) should be exactly the same
inside the loop of copy_values_to_vector, the type of each element of the vector is not known, but the output type of calling get_D_... is known to be Float64
if I remove the output type annotation of get_D_specific(el), the performance of both versions is the same – if I remove the “generic” version of get_D_specific altogether (i.e. only keep get_D_specific(::ConcreteElement) then the difference reappears

What I thought was happening is that the code using get_D_generic(el) has to do some additional type check and/or conversion because it cannot know for sure that the field D will actually contain a float.

But my main point of confusion is, why is there the need for implementing the specialized version for ConcreteElement with identical code? The presence of a generic “catch-all” method get_D_specialized(el)::Float64 seems to have no impact on looking up the right (manually defined) method in each loop iteration, but when using get_D_generic(el)::Float64 the loop function does not generate and pick a specialized version of it?

My current approach (if I don’t want to avoid the abstract-type container in the struct) would be to just define the methods for the individual concrete types, but just using one generic method would seem like a very convenient and sensible thing to do if all implementations have the same field D.

I can imagine that variations of this issue have been discussed before and I just couldn’t find/understand it, so I would also be happy about a link in the right direction

lmiq · February 6, 2023, 2:39pm

I think what you are implicitly doing is adding a return type to the getfield operation, and by that allowing the compiler to specialize the sum which, otherwise, is being performed without that information. Similar to this:

julia> struct A
           x::Vector{Real}
       end

julia> function sum(a::A)
           s = 0.0
           for i in eachindex(a.x)
               s += a.x[i]
           end
           return s
       end
sum (generic function with 1 method)

julia> a = A(rand(1000));

julia> @btime sum($a)
  9.915 μs (1000 allocations: 15.62 KiB)
479.45696929415135

julia> function sum_stable(a::A)
           s = 0.0
           for i in eachindex(a.x)
               s += a.x[i]::Float64 # informing the return type to the compiler
           end
           return s
       end
sum_stable (generic function with 1 method)

julia> @btime sum_stable($a)
  857.175 ns (0 allocations: 0 bytes)
479.45696929415135

In the same sense of the one of your post, here we need two different implementations to obtain type-stability for the sum, which is caused by having the vector of abstract types in your struct. This kind of thing is ideally solved by having a vector of concrete types, if possible. It the types are intrinsically variable, but you happen to know that the output of some specific operation has a type, but this type cannot be inferred, you need such an annotation.

These things happen, but they are not as common at they seem. In general one would parameterize the structure such to avoid the abstract container.

Sevi · February 6, 2023, 4:04pm

Thanks for your answer and the example! I see your point (and using parametric types or other solutions to get rid of this are definitely an option), but I’d still like to point out two things:

after reading this section of the docs again, I would have thought that your annotation a.x[i]::Float64 is what they refer to as “type assertion” which checks if the object is of the specified type and errors otherwise (but it obviously also helps the compiler in your example which is good to know!)
→ The difference would be that your code would break if A would also contain (say) integers as well as floats
in my snippet, the compiler should also have full knowledge of the return type of the get_D_... function, since it is fixed in the method definition of get_D_.... But in addition to just asserting the type, it will convert it if necessary
→ so putting different concrete implementations of AbstractElement into the vector does still work, but it is slower and creates allocations unless I add a corresponding method get_D(::ConcreteElement2) as well

The output from @code_warntype is as far as I can tell also identical for both loop functions in my original snippet. And since the type of the innermost object I want to have (the field D) is, in principle, known in both scenarios, I am just wondering why an explicit method for the concrete type is needed, although the (Julia) code is identical and the type information is also available in the loop.

lmiq · February 6, 2023, 5:19pm

I think these things are not exactly correct, and are related. Adding the additional method won’t help there in the general case, at each iteration of the loop there will be still the need to decide, at run time, which method to call.

The example where within the loop you call only the specific method that has a known output is the one where the compiler can optimize the whole loop.

And there can be intermediate situations, where the compiler figures out that there is a small number of possibilities and performs a union-splitting automatically. As one adds more types in the container, less likely is that the loop can be optimized, and at some point one can improve things by guaranteeing the output type of the operation by a type assertion, if that fits the logic.

Sevi · February 6, 2023, 7:55pm

Ah, that’s interesting, so the lookup of the most specific methods will at some point lead to an overhead? (I was aware that that could happen, but I thought it would be smaller, even for such a short loop operation like in my example).

I tried to expand my example to better understand what you wrote:

When I add one more (identical) ConcreteElement2, I get this for running main() with randomly populating the struct with the two different concrete types:

Two concrete subtypes

  20.482 μs (1001 allocations: 23.56 KiB) # only generic method
  8.791 μs (497 allocations: 15.69 KiB)   # specific method only for `ConcreteElement1`

Where 497 elements were of the type without the specific method.

And when I add the more specific implementation of the get_D_... for ConcreteElement2 it seems to help:

  19.388 μs (1001 allocations: 23.56 KiB) # only generic method
  1.948 μs (1 allocation: 7.94 KiB)       # specific lookups for both `ConcreteElement1` and `2`

So I thought that adding the specific methods is the way to go. But, as you mentioned, when there are more subtypes, it actually gets worse when implementing the specific lookup:

Three concrete subtypes

  17.590 μs (1001 allocations: 23.56 KiB)
  5.836 μs (304 allocations: 12.67 KiB)   # Now 2 / 3 concrete types have a lookup, the third one uses the fallback

but with the third identical method for ConcreteElement3, it turns into

  18.403 μs (1001 allocations: 23.56 KiB)
  39.320 μs (1490 allocations: 31.20 KiB) # All three concrete types with specific lookup method.

I still don’t get where the extra allocations are coming from in that last case, but that’s probably related to the union splitting you mentioned. Although the overhead in allocations seems to stay constant when the number of concrete subtypes increases (at least up to 10 different subtypes).

When I also add the type annotation similar to your example inside the loop then both functions have the same performance (1 allocation per execution and roughly same runtime). It doesn’t work when I annotate the output type of the function for every method.

Although I still feel like I have to learn a lot about how the Julia compiler works I think I am getting a better intuition why dynamic dispatch in a performance-critical loop might be worse than expected sometimes.

Thanks a lot for your explanations @lmiq !

lmiq · February 7, 2023, 11:29am

The overhead is large if the compiler cannot infer the output type of the operation. Then it has to allocate a place in the heap memory to store the result of the operation. That is what is called “boxing” of the result. The allocations in your example are caused by this boxing of the result, and has to happen for each of the operations in the loop where the compiler cannot know in advance the output type. Once that happens, it is not dependent on the number of subtypes.

Union splitting is the compiler transforming your loop in something like:

for obj in objects
    if typeof(obj) == Type1
       f(obj) # specific method for Type1
    elseif typeof(obj) == Type2
       f(obj) # specific method for Type2
    ...
   end
end

The overhead of doing that is small, and this is the good case. But the compiler only does that if it knows that the loop has a limited number of possibilities. At some point it gives up and run-time dispatch is used. Then, the result is boxed (an allocation occurs), and things get much slower.

One can sometimes get this behavior by declaring the input vector of mixed types as: Vector{Union{Type1,Type2}} instead of Vector{AbstractObject}.

Sevi · February 7, 2023, 11:52am

I get why this leads to what you describe (boxing / heap allocations), but I guess I don’t understand why it makes a difference if I

define all methods of the function as having output type Float64 → I get 1490 allocations when looping over the vector with 1000 randomly chosen concrete elements
or put the type annotation of Float64 inside the loop → I get only 1000 allocations for the same vector

Conceptually, both situations feel the same for me, so it is not obvious for me why the information is not available to the compiler in the first case – is it not possible (or too expensive) to check the return types of all methods when the loop code is compiled?

lmiq · February 7, 2023, 12:00pm

I think this does not matter once the compiler threw the towel on the number of types (also annotating the output rarely helps, most times the output type can be inferred from the input, so annotations are most times redundant). The annotation inside the loop, on the other side, is a guarantee to the compiler on that specific code.

Topic		Replies	Views
Performance problems when dealing with deliberately type-unstable code. How to use type knowledge better? Performance	33	2069	December 30, 2020
Dramatic performance change by adding additional unused method Performance	33	1938	November 23, 2021
Significant decrease in performance after seemingly irrelevant changes Performance	16	1188	June 29, 2019
Elimination of (unnecessary) runtime dispatch allocations Internals & Design box , allocations , runtime-dispatch	13	1134	May 5, 2024
Performance discrepancy with multiple dispatch Performance benchmark	4	375	April 27, 2024

Understanding specialized methods for field access

Related topics