How does Any actually work?

I do not work with the compiler in any capacity, but I can at least answer some of the questions you clarified and help you peek into the LLVM a bit. It’s still not very clear to me what you know or don’t know, so you’ll have to start overlooking any unexpected redundancies.

The simple answer is it does not constrain Dict{Symbol, Vector} any more because it is a concrete type with instances. That is also a red herring because you’re concerned about the abstract type parameter Vector and all the Any showing up in type inference.

Type inference is what the compiler can figure out in advance of any calls on any instances. Let’s take a simpler example Some, the simplest non-empty parametric struct possible. It’s definition is just struct Some{T} value::T end, and the API for accessing its value field is the function something. Some{Any} is a particular concrete type whose field can hold any instance: 2, 3im, "four", Vector[[5], []], you name it. Given that type, it’s clear that something can also return any instance, therefore the call signature something(::Some{Any}) can only have an inferred return type of Any. In more steps, that’s also why you see so many Any. If you index Dict{Symbol, Vector}, you get one of an infinite number of concrete subtypes of the abstract Vector, and indexing any of those subtypes can result in any instance.

Okay, the compiler can’t narrow down to concrete types sometimes, that’s understandable. But how does that make things slow? People have been trying to answer this, but maybe it’ll help if we look right at the code. First we look at something fast:

julia> plus1(x::Some) = something(x)+1
plus1 (generic function with 1 method)

julia> @code_llvm plus1(Some{Int}(2))
; Function Signature: plus1(Base.Some{Int64})
;  @ REPL[1]:1 within `plus1`
; Function Attrs: uwtable
define i64 @julia_plus1_963(ptr nocapture noundef nonnull readonly align 8 dereferenceable(8) %"x::Some") #0 {
top:
; ┌ @ int.jl:87 within `+`
   %"x::Some.value_ptr.unbox" = load i64, ptr %"x::Some", align 8
   %0 = add i64 %"x::Some.value_ptr.unbox", 1
   ret i64 %0
; └
}

Despite your (and my own) inexperience with LLVM, you can probably see that there are only 2 instructions before returning: load i64 from the Some{Int} and add i64 to 1. The compiler figured out all the types and methods from the concrete inputs, and we were rewarded with efficient code for a +(::Int64, ::Int64) call. Now let’s use Some{Any} for slow code:

julia> @code_llvm plus1(Some{Any}(2))
; Function Signature: plus1(Base.Some{Any})
;  @ REPL[3]:1 within `plus1`
; Function Attrs: uwtable
define nonnull ptr @julia_plus1_928(ptr nocapture noundef nonnull readonly align 8 dereferenceable(8) %"x::Some") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
; ┌ @ some.jl:103 within `something`
; │┌ @ Base.jl:49 within `getproperty`
    %"x::Some.value" = load atomic ptr, ptr %"x::Some" unordered, align 8
; └└
  store ptr %"x::Some.value", ptr %jlcallframe1, align 8
  %0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
  store ptr @"jl_global#933.jit", ptr %0, align 8
  %1 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#932.jit", ptr nonnull %jlcallframe1, i32 2)
  ret ptr %1
}

Oof that is much more of an eyesore, and I chose Some specifically to simplify it as much as possible. Let’s break it down:

  • stack-allocates a “callframe” with a couple pointers (ptr)
  • something(x) loads the data from the x::Some{Any} object’s value field. We’re dealing with a pointer at that field because instances vary in size, so the actual data is stored elsewhere on the heap.
  • store the loaded data into the call frame
  • store "jl_global#933.jit" into the call frame. It’s normal not to know what that even is, it’s an internal detail.
  • call @ijl_apply_generic on that call frame and return the result

You’ll notice that the compiled code clearly computed the something(x) part, but the +1 part (the entire point of the function itself) is entirely missing. Instead we prepare a call frame and call @ijl_apply_generic. Previous commenters were entirely correct, the compiler is lazy and does not do the +1 part here. Of course it can’t, like you’ve pointed out, there are an infinite number of possible concrete types to add 1 to, we can’t eagerly compile all the operations. Instead, the compiled code defers that work to another method dispatch at runtime. I’ll quote the docs’ own example:

Given the call f(x, y), the following steps are performed: first, the method table to use is accessed as typeof(f).name.mt. Second, an argument tuple type is formed, Tuple{typeof(f), typeof(x), typeof(y)}. Note that the type of the function itself is the first element. This is because the type might have parameters, and so needs to take part in dispatch. This tuple type is looked up in the method table.
This dispatch process is performed by jl_apply_generic, which takes two arguments: a pointer to an array of the values f, x, and y, and the number of values (in this case 3).

In our own plus1(Some{Any}(2)) example, the runtime dispatch finds Tuple{typeof(+), ::Int64, ::Int64}, which you’ll notice we’ve already dispatched to and compiled in the earlier plus1(Some{Int}(2)). In other calls, like plus1(Some{Any}(3im)), the same compiled code for plus1(::Some{Any}) has to find other + methods, potentially having to compile them as well. Method dispatch takes time*, which is why we try fairly hard to maintain type stability in hot code to let our compiler do that work at compile-time instead. Another reason for slowdowns was slightly mentioned earlier; an instance of an unknown type has an unknown size and can’t be planned by the compiler for inline or stack allocation, and accessing the heap through a pointer takes extra work for a whole host of reasons.

julia> using BenchmarkTools

julia> @btime plus1($(Some{Int}(2)))
  2.000 ns (0 allocations: 0 bytes)
3

julia> @btime plus1($(Some{Any}(2)))
  16.633 ns (0 allocations: 0 bytes)
3

*14-15ns is not bad once per μs, but pretty awful before each 2ns integer addition.

This is the comparable process in CPython. While CPython implements single dispatch at runtime, Julia has multiple dispatch that may occur at compile-time or runtime depending on how well type inference goes. It’s worth mentioning that there are implementations of Python (or subsets and supersets) where a compiler is around to do method dispatch at compile-time, though there are often practical constraints from compliance to the Python language standard.

5 Likes