I previously posted a question about dynamic dispatch leading to allocations. The answer I got was that the type instability that leads to dynamic dispatch also leads to the allocations - at least that’s what I understood. This was helpful, but I still don’t see why type instability needs to lead to allocations in certain situations.

One such situation is this:

v1 = Vector(undef, 4) # Vector{Any}
v1[1] = randn(Float64, 5)
v1[2] = randn(Float32, 7)
v1[3] = rand(UInt8, 3)
v1[4] = rand(Int16, 2)
v2 = [v1...] # Vector{Vector{Float64}}
function mysumv(V)::Float64
return sum(V)
end
function mysum(V)
t = 0.
for v in V
t += mysumv(v) # This always returns Float64
end
return t
end
@btime mysum($v1)
@btime mysum($v2)
julia> 175.546 ns (4 allocations: 64 bytes)
julia> 12.638 ns (0 allocations: 0 bytes)

We can see that mysum(v1) is type unstable because the input vector has element type Any, whereas mysum(v2) is type stable because the input vector has a concrete element type (Vector{Float64}). We can also see that this leads to allocations and 10x slower code in the former case.

In order to understand why allocations might be necessary, I think about how I’d write this in C++. I imagine a Vector{Any} being a vector of pointers to object headers that describe the objects, including their type. I imagine that dynamic dispatch would require a special dynamic dispatch function to be called that, given the pointer to the object header, looks up (and optionally compiles) the actual function to be called, and returns the function pointer. This function would then be called, with the object header pointer from the vector passed straight in. The return type is always known to be Float64, so there’s no type instability there. None of these operations requires a heap allocation. So I don’t really understand why Julia requires them.

Can anyone provide an explanation as to why the allocations are necessary here?

Having a way to make type some unstable code fast would be great, because in situations where type instability cannot be avoided, they might at least be able to be written to still run efficiently.

First let’s look at the inferred code for mysum(v1) (Julia 1.8.5):

julia> @code_warntype mysum(v1)
MethodInstance for mysum(::Vector{Any})
from mysum(V) in Main at REPL[13]:1
Arguments
#self#::Core.Const(mysum)
V::Vector{Any}
Locals
@_3::Union{Nothing, Tuple{Any, Int64}}
t::Float64
v::Any
Body::Float64
1 ─ (t = 0.0)
│ %2 = V::Vector{Any}
│ (@_3 = Base.iterate(%2))
│ %4 = (@_3 === nothing)::Bool
│ %5 = Base.not_int(%4)::Bool
└── goto #4 if not %5
2 ┄ %7 = @_3::Tuple{Any, Int64}
│ (v = Core.getfield(%7, 1))
│ %9 = Core.getfield(%7, 2)::Int64
│ %10 = t::Float64
│ %11 = Main.mysumv(v)::Float64
│ (t = %10 + %11)
│ (@_3 = Base.iterate(%2, %9))
│ %14 = (@_3 === nothing)::Bool
│ %15 = Base.not_int(%14)::Bool
└── goto #4 if not %15
3 ─ goto #2
4 ┄ return t

As expected, the type of v is inferred as Any and therefore some dynamic dispatch occurs to find a matching method for mysumv(v).

In your code, you ensured the return type of mysum(v) could be inferred (as Float64) but first let’s suppose it could not be inferred (i.e. is inferred as Any). In this case, Julia would have to compile all the methods of mysum(v) to return boxed Julia objects, which requires heap allocation.

My guess is that, even though the return type of mysum(v) is inferred (as Float64), it is still boxing the return value, and therefore still allocating. In principle, Julia could indeed compile versions of these methods to return Float64 natively, but I guess that in the face of dynamic dispatch, Julia currently always boxes the result. Presumably this is an optimization that could be implemented.

To confirm my hypothesis, we can avoid the boxing by saving the result from mysumv to a Ref and have mysumv return nothing. This makes mysum(v1) a little faster.

julia> function mysumv!(r, v)
r[] = sum(v)
return
end
mysumv! (generic function with 1 method)
julia> function mysum2(V)
t = 0.
r = Ref(0.0)
for v in V
mysumv!(r, v)
t += r[]
end
return t
end
mysum2 (generic function with 1 method)
julia> @btime mysum($v1)
91.604 ns (4 allocations: 64 bytes)
28621.18967872079
julia> @btime mysum2($v1)
81.828 ns (1 allocation: 16 bytes)
28621.18967872079

Note that this still has one allocation. This is probably the allocation of the Ref. Often Julia can avoid this allocation via escape analysis, but presumably the dynamic dispatch gets in the way of that.

If the user supplies the Ref then we can bring allocations down to zero but with no appreciable speed-up in mysum(v1) (it’s still 4x slower than mysum(v2) for me):

julia> function mysum3(r, V)
t = 0.
for v in V
mysumv!(r, v)
t += r[]
end
return t
end
mysum3 (generic function with 1 method)
julia> @btime mysum3($(Ref(0.0)), $v1)
78.431 ns (0 allocations: 0 bytes)
28621.18967872079

To summarise, it appears to take ~10ns to box the intermediate results, ~60ns to do dynamic dispatch and ~20ns to do the actual work. Hence the allocations coming from boxing have a fairly negligible impact on run-time.

Edit: @btime returns the minimum time, which can be deceiving when timing allocating code, since the GC is non-deterministic. Using @benchmark instead and looking at the mean time, the boxing does indeed take a larger portion of the time, but still less than dynamic dispatch.

Interesting. I felt confident that specifying the output of mysumv() would remove this boxing. If this is the source of the allocations, then as you say, it’s likely something that can be fixed. Can you not tell where boxing is occurring in the output of @code_warntype? I assumed it was on red lines of code, but the only red line in the for loop is at %7, which doesn’t relate to the output of mysumv().

If this is the case then I feel that the time taken for dynamic dispatch is something that can be significantly improved also, based on my previous tests.