Why does runtime dispatch allocate when the return type is inferred?

The usual explanation for why runtime dispatches allocate is that falling back to inferring ::Any for the return type requires boxing the instance at runtime. Thing is, there seems to be a workaround. If the compiler knows that there are only a few methods (>=1, defaults to 3) a call will dispatch to, it may infer all of them to potentially improve the return type inference. I don’t know exactly how and when the compiler can detect that, but to be safe, we wrap functions into a container with just 1 method.

# v1.11.5
struct Returns{ReturnType, CallableType}  f::CallableType  end

# our typical constructor
Returns{R}(f) where R = Returns{R, typeof(f)}(f)

# use constructor for automatic conversions by field or element types
Base.convert(::Type{Returns{R}}, f) where {R} = Returns{R}(f)

#=
The idea is that a runtime dispatch only goes to this method, so the
compiler can infer a return type of R for any method specialization.
If a given specialization's inner call has an inferred return type of R,
then the type assertion can be resolved at compile-time.
Note that Returns only needed to wrap a callable to substitute for it, so
if the inner callable could be an argument f::F, then the same R inference
could still occur, but the demo would be weirder and more opaque.
=#
(r::Returns{R})(args::Vararg{Any, N}) where {R,N} = @inline r.f(args...)::R

For our demo, let’s index an array of 1 function so that the compiler only sees the element type. We’ll also use the typical approach FunctionWrappers.jl for comparison.

# Vary the callable at runtime in this simpler example, but
# varying the argument types could also work
call1st(list, y) = @inbounds list[1](y) # unsafe but trims overhead

using FunctionWrappers: FunctionWrapper as FW # v1.1.3

add1(y) = y+1
stablelist = typeof(add1)[add1] # don't vary, for static dispatch
FWlist = FW{Float64, Tuple{Float64}}[add1] # magic concrete element type
returnslist = Returns{Float64}[add1]
anylist = Any[add1]

Now let’s see if telling the compiler that the callable must be of type Returns{Float64} improved type inference of that runtime-dispatched call:

julia> for list in (stablelist, FWlist, returnslist, anylist)
         println(Base.return_types(call1st, Tuple{typeof(list), Float64}))
       end
Any[Float64]
Any[Float64]
Any[Float64]
Any[Any]

Inference is as good as a static dispatch or using FunctionWrapper! But how’s the performance?

julia> using BenchmarkTools

julia> for list in (stablelist, FWlist, returnslist, anylist)
         @btime call1st($list, $1.2)
       end
  1.800 ns (0 allocations: 0 bytes)
  7.900 ns (0 allocations: 0 bytes)
  24.598 ns (2 allocations: 32 bytes)
  24.549 ns (2 allocations: 32 bytes)

julia> @btime add1($1.2) evals=100_000
  1.798 ns (0 allocations: 0 bytes)
2.2

julia> @btime $(Returns{Float64}(add1))($1.2) evals=100_000
  2.055 ns (0 allocations: 0 bytes)
2.2

julia> @btime $(FW{Float64, Tuple{Float64}}(add1))($1.2) evals=100_000
  8.286 ns (0 allocations: 0 bytes)
2.2

The static dispatch is as good as calling the function directly, though it’s worth noting I removed boundschecking overhead. FunctionWrapper’s function pointer and possible conversions add significant overhead, but no heap allocations. However, our Returns{Float64} inference performs as poorly and allocates as much as the Any case despite the change in type inference!

I don’t know how to dig deeper into compiled code to pinpoint heap allocations, and it really seems as if there doesn’t need to be one. add1(1.2) or Returns{Float64}(add1)(1.2) are so well inferred they just do fadd, no heap allocations needed. The runtime dispatch list[1](y) has a known return type, but it appears to box the instances just to unbox them. Why, and is it inherently necessary? Is jl_gc_pool_alloc_instrumented a heap allocation routine? Excerpts of the @code_llvm:

julia> @code_llvm call1st(returnslist, 1.2)
...
  %"box::Float64" = call noalias nonnull align 8 dereferenceable(16) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 16, i64 140703751232704) #8
  %"box::Float64.tag_addr" = getelementptr inbounds i64, ptr %"box::Float64", i64 -1
  store atomic i64 140703751232704, ptr %"box::Float64.tag_addr" unordered, align 8
  store double %"y::Float64", ptr %"box::Float64", align 8
  %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe1, i64 2
  store ptr %"box::Float64", ptr %gc_slot_addr_0, align 16
  store ptr %"box::Float64", ptr %jlcallframe, align 8
  %2 = call nonnull ptr @ijl_apply_generic(ptr nonnull %1, ptr nonnull %jlcallframe, i32 1)
  %.unbox = load double, ptr %2, align 8
  %frame.prev10 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev10, ptr %pgcstack, align 8
  ret double %.unbox

The only difference in @code_llvm call1st(anylist, 1.2) is the absence of the unboxing line %.unbox = load double, ptr %2, align 8 and the corresponding return of the box ret ptr %2. The name box::Float64 is present there too so that seems to be using the input y’s type.

3 Likes

Yh jl_gc_pool_alloc_instrumented is allocating using the GC

2 Likes

The calling convention for dynamic dispatch requires boxing of many values that would otherwise be stack-allocated. This includes Float64s, but excludes most mutable types and small integers (Julia has a cache of boxed small Ints).

There was some discussion around the problem and an experimental PR a couple years back, see `hash(::Vector{<abstract type>}, ::UInt)` allocates quite a bit due to type instability: why can't hash(::Any) infer `UInt` return type? · Issue #44244 · JuliaLang/julia · GitHub and [WIP] dynamic dispatch optimization: avoid boxing stack-allocated inputs by NHDaly · Pull Request #50136 · JuliaLang/julia · GitHub. I suspect both will contain relevant info for the thread you forked this new topic off of.

1 Like

The devdocs doesn’t explain many things and I’m not even sure what a box means anymore, but supposing that boxes are referenced heap-allocated anythings, it seems like the arguments and return value being strictly “pointers to boxes” would explain the 2 allocations:

  1. the stack-allocated input (1.2) gets stored in a box in preparation for the runtime-dispatched call. The box is allocated in the line %"box::Float64" = call noalias nonnull align 8 dereferenceable(16) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 16, i64 140703751232704) #8
  2. the return value of the runtime dispatch is a pointer to a box it allocated for the method specialization’s return value. The runtime dispatch occurs in the line %2 = call nonnull ptr @ijl_apply_generic(ptr nonnull %1, ptr nonnull %jlcallframe, i32 1)

Interestingly, the box allocation in (1) seems to be elided if the input was already “boxed” to begin with, like an element of Array{Number} (and not Array{Float64}):

julia> call1st1st(list, list2) = @inbounds list[1](list2[1])
call1st1st (generic function with 1 method)

julia> @btime call1st($returnslist, $1.2) # in the original example
  27.884 ns (2 allocations: 32 bytes)
2.2

julia> @btime call1st1st($returnslist, $([1.2])) # element not boxed
  28.643 ns (2 allocations: 32 bytes)
2.2

julia> @btime call1st1st($returnslist, $(Number[1.2])) # element boxed
  24.297 ns (1 allocation: 16 bytes)
2.2

julia> @code_llvm call1st1st(returnslist, Number[1.2])
...
   %2 = load ptr, ptr %"list2::Array", align 8
   %3 = load atomic ptr, ptr %2 unordered, align 8
...
  store ptr %3, ptr %jlcallframe, align 8
  %4 = call nonnull ptr @ijl_apply_generic(ptr nonnull %1, ptr nonnull %jlcallframe, i32 1)
  %.unbox = load double, ptr %4, align 8
...
  ret double %.unbox
}

What is it about boxes or jl_apply_generic that requires this? Can the information in the boxes really not be put on the caller’s stack for the generic call to read from and write to?

I believe so. The missing info you want is likely Memory layout of Julia Objects · The Julia Language, which explains the layout of the jl_value_t*s that jl_value_t *jl_apply_generic(jl_value_t *F, jl_value_t **args, uint32_t nargs) expects. It’s unfortunate that the unrelated Core.Box also exists to cause confusion, but thankfully that doesn’t show up in any of these examples.

Yes, this was what I was alluding to with my first sentence about “requires boxing…” earlier. Array{Number}, is, roughly speaking, an array of already-boxed jl_value_t*s.

The signature of jl_apply_generic is rather restrictive and makes a number of assumptions around the lifetimes of the jl_value_t **args it receives. Part of the linked PR discussion is whether it’d be safe to “fake” jl_value_t *s that point to the stack instead of GC-allocated and tracked memory—the answer appears to be no, not without non-trivial changes on the GC side.

Regardless of what the best solution is, it’d likely require coordinated changes to codegen and the runtime, as noted in [WIP] dynamic dispatch optimization: avoid boxing stack-allocated inputs by NHDaly · Pull Request #50136 · JuliaLang/julia · GitHub. If I had to guess, this level of complexity is a big part of why it hasn’t been done yet.

1 Like

The jl_value_t struct is the name for a block of memory owned by the Julia Garbage Collector, representing the data associated with a Julia object in memory. Absent any type information, it is simply an opaque pointer:

typedef struct jl_value_t* jl_pvalue_t;

Each jl_value_t struct is contained in a jl_typetag_t struct that contains metadata information about the Julia object, such as its type and garbage collector (gc) reachability

That seems to explain why the expected metadata doesn’t show up in memory allocation measurements, like say a Vector{Number}, though it’s still not clear to me whether heterogeneous collections point to the jl_typetag_t or just the jl_value_t for the necessary runtime type information. The text seems to imply the latter, but if so, what uses the metadata?

It does at least make it more apparent that boxes are implemented to be GC-tracked, it’s not just a pointer to anywhere and isn’t a language-level thing that could undergo heap-to-stack optimizations. It’s at least acknowledged to not be strictly necessary for runtime dispatch, which is something FunctionWrappers also demonstrates. I went through the trouble of making this topic because FunctionWrapper’s function pointer magic doesn’t respond to method invalidations well even if I try the internal reinit_wrapper, so Returns{R}(f) would be neater if runtime dispatch can load and write unboxed inputs and outputs.