The usual explanation for why runtime dispatches allocate is that falling back to inferring ::Any
for the return type requires boxing the instance at runtime. Thing is, there seems to be a workaround. If the compiler knows that there are only a few methods (>=1, defaults to 3) a call will dispatch to, it may infer all of them to potentially improve the return type inference. I don’t know exactly how and when the compiler can detect that, but to be safe, we wrap functions into a container with just 1 method.
# v1.11.5
struct Returns{ReturnType, CallableType} f::CallableType end
# our typical constructor
Returns{R}(f) where R = Returns{R, typeof(f)}(f)
# use constructor for automatic conversions by field or element types
Base.convert(::Type{Returns{R}}, f) where {R} = Returns{R}(f)
#=
The idea is that a runtime dispatch only goes to this method, so the
compiler can infer a return type of R for any method specialization.
If a given specialization's inner call has an inferred return type of R,
then the type assertion can be resolved at compile-time.
Note that Returns only needed to wrap a callable to substitute for it, so
if the inner callable could be an argument f::F, then the same R inference
could still occur, but the demo would be weirder and more opaque.
=#
(r::Returns{R})(args::Vararg{Any, N}) where {R,N} = @inline r.f(args...)::R
For our demo, let’s index an array of 1 function so that the compiler only sees the element type. We’ll also use the typical approach FunctionWrappers.jl for comparison.
# Vary the callable at runtime in this simpler example, but
# varying the argument types could also work
call1st(list, y) = @inbounds list[1](y) # unsafe but trims overhead
using FunctionWrappers: FunctionWrapper as FW # v1.1.3
add1(y) = y+1
stablelist = typeof(add1)[add1] # don't vary, for static dispatch
FWlist = FW{Float64, Tuple{Float64}}[add1] # magic concrete element type
returnslist = Returns{Float64}[add1]
anylist = Any[add1]
Now let’s see if telling the compiler that the callable must be of type Returns{Float64}
improved type inference of that runtime-dispatched call:
julia> for list in (stablelist, FWlist, returnslist, anylist)
println(Base.return_types(call1st, Tuple{typeof(list), Float64}))
end
Any[Float64]
Any[Float64]
Any[Float64]
Any[Any]
Inference is as good as a static dispatch or using FunctionWrapper
! But how’s the performance?
julia> using BenchmarkTools
julia> for list in (stablelist, FWlist, returnslist, anylist)
@btime call1st($list, $1.2)
end
1.800 ns (0 allocations: 0 bytes)
7.900 ns (0 allocations: 0 bytes)
24.598 ns (2 allocations: 32 bytes)
24.549 ns (2 allocations: 32 bytes)
julia> @btime add1($1.2) evals=100_000
1.798 ns (0 allocations: 0 bytes)
2.2
julia> @btime $(Returns{Float64}(add1))($1.2) evals=100_000
2.055 ns (0 allocations: 0 bytes)
2.2
julia> @btime $(FW{Float64, Tuple{Float64}}(add1))($1.2) evals=100_000
8.286 ns (0 allocations: 0 bytes)
2.2
The static dispatch is as good as calling the function directly, though it’s worth noting I removed boundschecking overhead. FunctionWrapper
’s function pointer and possible conversions add significant overhead, but no heap allocations. However, our Returns{Float64}
inference performs as poorly and allocates as much as the Any
case despite the change in type inference!
I don’t know how to dig deeper into compiled code to pinpoint heap allocations, and it really seems as if there doesn’t need to be one. add1(1.2)
or Returns{Float64}(add1)(1.2)
are so well inferred they just do fadd
, no heap allocations needed. The runtime dispatch list[1](y)
has a known return type, but it appears to box the instances just to unbox them. Why, and is it inherently necessary? Is jl_gc_pool_alloc_instrumented
a heap allocation routine? Excerpts of the @code_llvm
:
julia> @code_llvm call1st(returnslist, 1.2)
...
%"box::Float64" = call noalias nonnull align 8 dereferenceable(16) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 16, i64 140703751232704) #8
%"box::Float64.tag_addr" = getelementptr inbounds i64, ptr %"box::Float64", i64 -1
store atomic i64 140703751232704, ptr %"box::Float64.tag_addr" unordered, align 8
store double %"y::Float64", ptr %"box::Float64", align 8
%gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe1, i64 2
store ptr %"box::Float64", ptr %gc_slot_addr_0, align 16
store ptr %"box::Float64", ptr %jlcallframe, align 8
%2 = call nonnull ptr @ijl_apply_generic(ptr nonnull %1, ptr nonnull %jlcallframe, i32 1)
%.unbox = load double, ptr %2, align 8
%frame.prev10 = load ptr, ptr %frame.prev, align 8
store ptr %frame.prev10, ptr %pgcstack, align 8
ret double %.unbox
The only difference in @code_llvm call1st(anylist, 1.2)
is the absence of the unboxing line %.unbox = load double, ptr %2, align 8
and the corresponding return of the box ret ptr %2
. The name box::Float64
is present there too so that seems to be using the input y
’s type.