# Why does this type instability lead to allocations?

I previously posted a question about dynamic dispatch leading to allocations. The answer I got was that the type instability that leads to dynamic dispatch also leads to the allocations - at least that’s what I understood. This was helpful, but I still don’t see why type instability needs to lead to allocations in certain situations.

One such situation is this:

``````v1 = Vector(undef, 4)           # Vector{Any}
v1[1] = randn(Float64, 5)
v1[2] = randn(Float32, 7)
v1[3] = rand(UInt8, 3)
v1[4] = rand(Int16, 2)
v2 = [v1...]                    # Vector{Vector{Float64}}

function mysumv(V)::Float64
return sum(V)
end

function mysum(V)
t = 0.
for v in V
t += mysumv(v)   # This always returns Float64
end
return t
end

@btime mysum(\$v1)
@btime mysum(\$v2)

julia>  175.546 ns (4 allocations: 64 bytes)
julia>  12.638 ns (0 allocations: 0 bytes)
``````

We can see that `mysum(v1)` is type unstable because the input vector has element type `Any`, whereas `mysum(v2)` is type stable because the input vector has a concrete element type (`Vector{Float64}`). We can also see that this leads to allocations and 10x slower code in the former case.

In order to understand why allocations might be necessary, I think about how I’d write this in C++. I imagine a `Vector{Any}` being a vector of pointers to object headers that describe the objects, including their type. I imagine that dynamic dispatch would require a special dynamic dispatch function to be called that, given the pointer to the object header, looks up (and optionally compiles) the actual function to be called, and returns the function pointer. This function would then be called, with the object header pointer from the vector passed straight in. The return type is always known to be Float64, so there’s no type instability there. None of these operations requires a heap allocation. So I don’t really understand why Julia requires them.

Can anyone provide an explanation as to why the allocations are necessary here?

Having a way to make type some unstable code fast would be great, because in situations where type instability cannot be avoided, they might at least be able to be written to still run efficiently.

1 Like

First let’s look at the inferred code for `mysum(v1)` (Julia 1.8.5):

``````julia> @code_warntype mysum(v1)
MethodInstance for mysum(::Vector{Any})
from mysum(V) in Main at REPL[13]:1
Arguments
#self#::Core.Const(mysum)
V::Vector{Any}
Locals
@_3::Union{Nothing, Tuple{Any, Int64}}
t::Float64
v::Any
Body::Float64
1 ─       (t = 0.0)
│   %2  = V::Vector{Any}
│         (@_3 = Base.iterate(%2))
│   %4  = (@_3 === nothing)::Bool
│   %5  = Base.not_int(%4)::Bool
└──       goto #4 if not %5
2 ┄ %7  = @_3::Tuple{Any, Int64}
│         (v = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│   %10 = t::Float64
│   %11 = Main.mysumv(v)::Float64
│         (t = %10 + %11)
│         (@_3 = Base.iterate(%2, %9))
│   %14 = (@_3 === nothing)::Bool
│   %15 = Base.not_int(%14)::Bool
└──       goto #4 if not %15
3 ─       goto #2
4 ┄       return t
``````

As expected, the type of `v` is inferred as `Any` and therefore some dynamic dispatch occurs to find a matching method for `mysumv(v)`.

In your code, you ensured the return type of `mysum(v)` could be inferred (as `Float64`) but first let’s suppose it could not be inferred (i.e. is inferred as `Any`). In this case, Julia would have to compile all the methods of `mysum(v)` to return boxed Julia objects, which requires heap allocation.

My guess is that, even though the return type of `mysum(v)` is inferred (as `Float64`), it is still boxing the return value, and therefore still allocating. In principle, Julia could indeed compile versions of these methods to return `Float64` natively, but I guess that in the face of dynamic dispatch, Julia currently always boxes the result. Presumably this is an optimization that could be implemented.

1 Like

To confirm my hypothesis, we can avoid the boxing by saving the result from `mysumv` to a `Ref` and have `mysumv` return `nothing`. This makes `mysum(v1)` a little faster.

``````julia> function mysumv!(r, v)
r[] = sum(v)
return
end
mysumv! (generic function with 1 method)

julia> function mysum2(V)
t = 0.
r = Ref(0.0)
for v in V
mysumv!(r, v)
t += r[]
end
return t
end
mysum2 (generic function with 1 method)

julia> @btime mysum(\$v1)
91.604 ns (4 allocations: 64 bytes)
28621.18967872079

julia> @btime mysum2(\$v1)
81.828 ns (1 allocation: 16 bytes)
28621.18967872079
``````

Note that this still has one allocation. This is probably the allocation of the `Ref`. Often Julia can avoid this allocation via escape analysis, but presumably the dynamic dispatch gets in the way of that.

If the user supplies the `Ref` then we can bring allocations down to zero but with no appreciable speed-up in `mysum(v1)` (it’s still 4x slower than `mysum(v2)` for me):

``````julia> function mysum3(r, V)
t = 0.
for v in V
mysumv!(r, v)
t += r[]
end
return t
end
mysum3 (generic function with 1 method)

julia> @btime mysum3(\$(Ref(0.0)), \$v1)
78.431 ns (0 allocations: 0 bytes)
28621.18967872079
``````

To summarise, it appears to take ~10ns to box the intermediate results, ~60ns to do dynamic dispatch and ~20ns to do the actual work. Hence the allocations coming from boxing have a fairly negligible impact on run-time.

Edit: `@btime` returns the minimum time, which can be deceiving when timing allocating code, since the GC is non-deterministic. Using `@benchmark` instead and looking at the mean time, the boxing does indeed take a larger portion of the time, but still less than dynamic dispatch.

2 Likes

Many thanks for your input, @cjdoris.

Interesting. I felt confident that specifying the output of `mysumv()` would remove this boxing. If this is the source of the allocations, then as you say, it’s likely something that can be fixed. Can you not tell where boxing is occurring in the output of `@code_warntype`? I assumed it was on red lines of code, but the only red line in the for loop is at %7, which doesn’t relate to the output of `mysumv()`.

If this is the case then I feel that the time taken for dynamic dispatch is something that can be significantly improved also, based on my previous tests.