Performance problems when dealing with deliberately type-unstable code. How to use type knowledge better?

I have a problem that crops up again and again, which relates to the use of different concrete types in situations where I can’t possibly establish type stability. Here’s a contrived example: There are three different subtypes, and I have a vector of two of them, which is parameterized with the abstract type. Therefore, the compiler doesn’t know the return type of the function get_value. But I as the programmer do know that it will always be Float64, because I have set up the logic that way. How can I capitalize on that knowledge?

In the second example I’ve tried to supply the compiler with the knowledge that there should only be Float64s coming out of that particular get_value call. The result is that the allocations are halved, but the computation time is doubled.

Please don’t suggest changes to this example that get away from the question, I purposefully set it up not to use union-splitting, etc., because that’s the situation I face most often. What can I do in a situation with unknown input types but known output types?

abstract type Abstract end

struct ConcreteA <: Abstract
    a::Float64
end

struct ConcreteB <: Abstract
    b::Float64
end

struct ConcreteC <: Abstract
    c
end

# get_value(::Abstract) is inferred Any because of ConcreteC
get_value(c::ConcreteA) = c.a
get_value(c::ConcreteB) = c.b
get_value(c::ConcreteC) = c.c

# make a vector of only ConcreteA and B
concretes_ab = [rand() < 0.5 ? ConcreteA(rand()) : ConcreteB(rand()) for _ in 1:10000]

# a function that sums the values, without asserting Float64
function sum_concretes_1(cs)
    x = 0.0
    for c in cs
        val = get_value(c)
        x += val
    end
    x
end

# and with Float64
function sum_concretes_2(cs)
    x = 0.0
    for c in cs
        val::Float64 = get_value(c)
        x += val
    end
    x
end

What explains the time / allocation pattern here? Why is specifying Float64 not beneficial? Can I somehow avoid allocations altogether even while keeping c’s type unknown to the compiler?

julia> @time sum_concretes_1(concretes_ab)
  0.000391 seconds (20.00 k allocations: 312.500 KiB)
5055.997450986205

julia> @time sum_concretes_2(concretes_ab)
  0.001492 seconds (10.00 k allocations: 156.266 KiB)
5055.997450986205
1 Like

Therefore, the compiler doesn’t know the return type of the function get_value . But I as the programmer do know that it will always be Float64 , because I have set up the logic that way. How can I capitalize on that knowledge?

You definitely can but this requires slight changes to precisely let Julia know of your knowledge. For instance:

abstract type Abstract end

abstract type SubAbstract <: Abstract end

struct ConcreteA <: SubAbstract
    a::Float64
end

struct ConcreteB <: SubAbstract
    b::Float64
end

struct ConcreteC <: Abstract
    c
end

# get_value(::Abstract) is inferred Any because of ConcreteC
get_value(c::ConcreteA) = c.a
get_value(c::ConcreteB) = c.b
get_value(c::ConcreteC) = c.c

# make a vector of only ConcreteA and B
concretes_ab = [rand() < 0.5 ? ConcreteA(rand()) : ConcreteB(rand()) for _ in 1:10000]

# a function that sums the values, without asserting Float64
function _sum_(cs)
    x = 0.0
    for c in cs
        val = get_value(c)
        x += val
    end
    x
end

Now, concretes_ab is a Array{SubAbstract,1} and Julia is capable of figuring out that get_value on SubAbstract returns a Float64.

julia> @btime _sum_($concretes_ab);
  55.210 μs (0 allocations: 0 bytes)

Why is specifying Float64 not beneficial?

Essentially adding Float64 helps for the + function (this is why you have halved the allocations) but the compiler still needs to figure out which get_value method to call.

Since you said that union splitting shouldn’t be used here (why?) the only thing that comes to mind is

function sum_concretes_3(cs)
    x = 0.0
    for c in cs
        val = get_value(c)::Float64
        x += val
    end
    x
end

It halves number of allocations and execution time. You can see the difference with the help of @code_warntype macro

@code_warntype sum_concretes_1(concretes_ab)
@code_warntype sum_concretes_2(concretes_ab)
@code_warntype sum_concretes_3(concretes_ab)

In case of sum_concretes_2 get_value returns value of the type Any and after that it is converted to Float64. In sum_concretes_3 get_value you hint compiler that return value of the function has type Float64.

3 Likes

It works only for small number of subtypes though. On 1.7-DEV

abstract type Abstract end

abstract type SubAbstract <: Abstract end

struct ConcreteA1 <: SubAbstract
    a::Float64
end

struct ConcreteA2 <: SubAbstract
    b::Float64
end

struct ConcreteA3 <: SubAbstract
    b::Float64
end

struct ConcreteA4 <: SubAbstract
    b::Float64
end

struct ConcreteA5 <: SubAbstract
    b::Float64
end

struct ConcreteC <: Abstract
    c
end

# get_value(::Abstract) is inferred Any because of ConcreteC
get_value(c::ConcreteA1) = c.a
get_value(c::ConcreteA2) = c.b
get_value(c::ConcreteA3) = c.b
get_value(c::ConcreteA4) = c.b
get_value(c::ConcreteA5) = c.b
get_value(c::ConcreteC) = c.c

# make a vector of only ConcreteA and B
concretes_ab = [rand() < 0.5 ? ConcreteA1(rand()) : ConcreteA2(rand()) for _ in 1:10000]

# a function that sums the values, without asserting Float64
function _sum_(cs)
    x = 0.0
    for c in cs
        val = get_value(c)
        x += val
    end
    x
end

using BenchmarkTools
@btime _sum_($concretes_ab)

#  382.936 μs (20000 allocations: 312.50 KiB)

When number of methods reaches some critical number compiler gives up and wrap everything in Any.

1 Like

I don’t think I have much to say here, but declaring the vector like this turns out to solve the problem of allocations in this case and is also 5 times faster than the SubAbstract option. I don’t know if that falls into the invalid options:

# make a vector of only ConcreteA and B
concretes_ab2 = Union{ConcreteA,ConcreteB}[rand() < 0.5 ? ConcreteA(rand()) : ConcreteB(rand()) for _ in 1:10000]

Edit: I guess this will also not work for vectors with more subtypes, as mentioned above, because the compiler is probably doing splitting and it will give up.

1 Like

Interesting with the ::Float64 assertion on the right hand side vs left hand side, I didn’t know that it made a difference for performance. Maybe that’s the best one can do in these situations but I still would like to understand better why.

To the other suggestions with union splitting / subtyping differently / parameterizing more tightly: I’m aware of these things but as mentioned already, often these approaches are not feasible due to a large number of types, or because you would need multiple different subtype hierarchies for different applications. I’m trying to understand how to make the most out of the type unstable situation assuming that I can’t reduce the degree of type instability (even though in the example you technically could)

1 Like

Also, if I understand correctly, the allocations come from boxing the return values. But can’t I go in and say, don’t box, these return values are always going to be floats (or you can error)? If that’s not technically possible, why not?

Just for reference, if you can subtype all your structs from a single abstract type, then you can use Macro to write function with many conditionals - #8 by Skoffer to make union splitting irrelevant of the number of types you define.

1 Like

Only if all these subtypes exist at first invocation, no?

I suppose yes. Maybe it is possible to use IRTools.jl or Cassette.jl to dynamically adjust code, but it is more complicated of course.

Maybe https://github.com/tkoolen/TypeSortedCollections.jl could be useful.

1 Like

Do you think that could be converted into a macro to be added as a flag to the for loop (as @threads), that splits the loop on all subtypes of the collection being iterated? I think that would be very useful. I know that one should avoid dispatching on these mixed collections, but that is quite a common problem and the simplicity of the code it allows is sometimes important. Also, it is not a marginal gain. In the example here, with 5 subtypes, the speedup can go from 2 times to 20 times depending on the type-specificity of the fields of the concrete types involved.

It could have a parameter for defining the maximum number of subtypes, such that it does not get used inappropriately for abstract types with too many subtypes.

I was imagining something like

@splitloop AbstractType for c in cs
   ...
end

or

@splitloop Union{C1,C2,C3,C4} for c in cs
   ...
end

that splits the loop in

for c in cs
  if c isa C1
    ....
  elseif c isa C2
     ....
  etc.
end

this is the correct solution, OP’s original version 2 is doing conversion, instead do this:

julia> # and with Float64
       function sum_concretes_3(cs)
           x = 0.0
           for c in cs
               x += get_value(c)::Float64
           end
           x
       end
julia> @btime sum_concretes_1($concretes_ab)
  240.867 μs (20000 allocations: 312.50 KiB)
4987.173540882938

julia> @btime sum_concretes_2($concretes_ab)
  559.892 μs (10000 allocations: 156.25 KiB)
4987.173540882938

julia> @btime sum_concretes_3($concretes_ab)
  71.847 μs (10000 allocations: 156.25 KiB)
4987.173540882938

What I still don’t understand is the need for boxing at all if I instruct the compiler that there will always be a Float64 at this location. Can the behavior of checking if the value is a Float64 not be done without allocations? Or couldn’t the box be reused in every loop, effectively needing only 1 allocation?

1 Like

The problem is that the same compiled loop body must work for all iterations, and the compiler doesn’t know that c values are always ConcreteA or ConcreteB, so the loop body has to work with get_value(::ConcreteC) which returns a boxed value.

It helps to write get_value(c)::Float64 to get a more efficient loop body, but this is just annotating an expression, you could also write it (get_value(c))::Float64, it doesn’t change how get_value is compiled.

So for the loop body, the compiler knows that get_value(c) will be a Float64, but that doesn’t say anything about c. It must still assume that c could be a ConcreteC, meaning that get_value could return a boxed value.

To fix this, you can use an array with a more precise type (the subtype solution proposed above, or a union type). Then it’s clear the loop only calls get_value(::ConcreteA) and get_value(::ConcreteB) so no unboxing is necessary.

Another solution is to remove boxing from the compilation of get_value(::ConcreteC). This requires annotating the definition of get_value itself:

get_value(c::ConcreteC)::Float64 = c.c

This gives very fast code with zero allocation.

4 Likes

Why isn’t it possible to say “I know this expression can’t be proven to return only a Float64, but I personally guarantee that it will and I’ll accept an error if that condition is not met”. Then c could maybe be returned without box as long as it’s really a Float64. I guess I would even be fine with undefined behavior in case the assumption wasn’t met, I see this as a last-resort optimization that could help, like @inbounds allows segmentation faults when I pass an invalid index, there could be a @nobox macro.

1 Like

That’s exactly what get_value(c)::Float64 means. It will throw an exception if the value returned by get_value(c) is not a (possibly boxed) Float64. This doesn’t change the fact that get_value(::ConcreteC) returns a boxed value. Functions are specialized based on the input types, not based on annotations of result expressions in the calling code.

2 Likes

True, but I never call get_value(::ConcreteC), in the Vector there is only A and B. That’s kind of the point, I know in this code there will only be Float64s returned, but they are boxed because the calling code can’t infer all possible return values due to the untyped field of ConcreteC. So that’s where a @nobox macro could potentially come in.

One scenario for this came up the other day where somebody tried to write a ray tracer and had a lot of allocations due to unpredictable types (lights, materials, objects). But what you can predict is what the output types of your functions are going to be in all cases. Maybe hits, or rgb colors, or something else. And the question is, can’t we elide the unnecessary boxing that the compiler uses to verify the output types of uninferrable calls if we give a manual guarantee

Here’s a slightly different example that gets closer to the problem. There are too many structs to infer get_value correctly, even though all structs have a concretely typed field x::Float64. It’s impossible to completely capitalize on that it seems.

abstract type Abstract end

# make a couple of concrete subtypes, here they all share the same field x::Float64, but
# there are too many of them so the compiler won't infer `get_value`
for x in 'A':'Z'
    name = Symbol("Concrete$x")
    @eval struct $name <: Abstract
        x::Float64
    end
end

get_value(a::Abstract) = a.x

cs = repeat([ConcreteA(1), ConcreteB(1), ConcreteC(1), ConcreteD(1), ConcreteE(1)], 10000)


function test(cs)
    x = 0.0
    for c in cs
        val = get_value(c)
        x += val
    end
    x
end

function test_2(cs)
    x = 0.0
    for c in cs
        val = get_value(c)::Float64
        x += val
    end
    x
end

This gives the following results, similar to the above example:

julia> @time test(cs)
  0.003093 seconds (100.00 k allocations: 1.526 MiB)
50000.0

julia> @time test_2(cs)
  0.001635 seconds (50.00 k allocations: 781.266 KiB)
50000.0
1 Like

OK that’s a bit different. Note that even with two types ('A':'B') this code doesn’t infer. It does infer (with two types) when defining concrete methods get_value(a::$name) = a.x.

I don’t see a solution that avoids union splitting. Something like @nonboxed Float64 ... sounds useful though a bit weird, and it could be problematic since whether a particular method returns a boxed value can change as the compiler evolves…