Performance discrepancy for different ways of piping/composing functions for handling runtime dispatch

MWE (tested on Julia 1.11.5):

julia> foo1(::Val{N}) where {N} = N
foo1 (generic function with 1 method)

julia> foo2(n::Int) = Val(n)
foo2 (generic function with 1 method)

julia> a = Val.(rand(Int, 1000));

julia> using BenchmarkTools

julia> @benchmark for i in $a
           i |> foo1 |> foo2
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  267.700 μs …  2.704 ms  ┊ GC (min … max): 0.00% … 88.68%
 Time  (median):     280.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   284.462 μs ± 51.782 μs  ┊ GC (mean ± σ):  0.67% ±  3.25%

             ▄██▅▁▂▁▁
  ▁▁▁▁▁▁▁▂▄▆▇█████████▄▄▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  268 μs          Histogram: frequency by time          320 μs <

 Memory estimate: 46.88 KiB, allocs estimate: 2000.

julia> @benchmark for i in $a
           foo1(i) |> foo2
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  262.300 μs …  3.058 ms  ┊ GC (min … max): 0.00% … 89.87%
 Time  (median):     274.600 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   278.443 μs ± 62.050 μs  ┊ GC (mean ± σ):  0.69% ±  2.79%

                 ▂▃█▆▆▃
  ▂▁▁▁▁▂▂▃▃▃▃▄▅▆████████▇▅▄▄▄▆▆▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  262 μs          Histogram: frequency by time          301 μs <

 Memory estimate: 46.88 KiB, allocs estimate: 2000.

julia> @benchmark for i in $a
           foo2(i |> foo1)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  152.900 μs …  2.155 ms  ┊ GC (min … max): 0.00% … 90.02%
 Time  (median):     157.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.376 μs ± 44.962 μs  ┊ GC (mean ± σ):  0.65% ±  2.20%

          ▁▂▆█▄▇▂▂
  ▂▂▂▃▃▄▄▅████████▇▇▅▄▄▃▃▃▃▃▂▃▃▃▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  153 μs          Histogram: frequency by time          174 μs <

 Memory estimate: 31.25 KiB, allocs estimate: 1000.

julia> @benchmark for i in $a
           foo2(foo1(i))
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  150.100 μs …  2.691 ms  ┊ GC (min … max): 0.00% … 92.80%
 Time  (median):     157.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   159.191 μs ± 43.401 μs  ┊ GC (mean ± σ):  0.57% ±  2.02%

            ▇█▄▃▁
  ▁▁▁▁▂▂▂▃▅██████▅▅▃▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  150 μs          Histogram: frequency by time          184 μs <

 Memory estimate: 31.25 KiB, allocs estimate: 1000.

julia> @benchmark for i in $a
           i |> (foo2∘foo1)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  56.100 μs … 403.200 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.965 μs ±   6.902 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂█▄▅
  ▁▂▃████▇▄▄▃▂▂▂▂▃▃▂▂▂▁▂▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  56.1 μs         Histogram: frequency by time           72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

It’s not surprising that (foo2∘foo1) gave the best performance, since when foo1 and foo2 were compiled together, the compiler could eliminate some of the internal conversion. However, it’s still not as fast as

julia> @benchmark for i in $a
           i |> identity
       end
BenchmarkTools.Trial: 10000 samples with 626 evaluations per sample.
 Range (min … max):  187.859 ns … 389.457 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     188.498 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   191.519 ns ±   8.910 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▂▂▂▄▃▃▁  ▁▂▁        ▄▆▅        ▁▂▂▁                 ▁      ▂
  ███████████▇███▇▇▇▇▅▅▅▄█████▆▇▆▇▇▇████▆▆▆▁▄▄▁▁▄▅▆▅▅▇▆▇██▅▁▁▇▇ █
  188 ns        Histogram: log(frequency) by time        205 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I assume the reason is that currently we don’t have the ability to annotate non-Type parameters in a parametric type:

julia> typeof(a)
Vector{Val} # not something like `Vector{Val{::Int}}`

Still, I wonder if the performance discrepancy (as much as they presented here) among these implementation details is expected? And is it possible for the gaps to be eliminated in the future? Thanks!!

What’s the motivation here, though? Where is any of these examples relevant?

  • Often it’s actually preferable to use Vector{Any} rather than Vector{T} for some T that’s neither bottom, concrete, or Any. So I’m not sure this is a good motivation for Val{::Int}.

  • If you have a collection of Int, why are you wrapping each in a Val?

  • Assuming you really do want to put a bunch of Int into the type domain, why would you use Vector to store the heterogeneous elements?

Vector{Val} was just used for providing an extreme test set to benchmark different piping implementations.

In my actual scenarios, I do not use Vector to directly store them. What I have is a collection of composite type instances ::MyType{N, O, ...} with non-Type type parameters N that I need to extract at runtime and recapsulate them into the signature of some other parametric type. On the one hand, the possible values of N are not nearly as many as 1000, but they are inhomogeneous (having different values). On the other hand, the number of MyType instances is very large, so I cannot just use Tuple to store them. I would also prefer the collection container to be of type AbstractArray{<:MyType} rather than simply AbstractArray{Any}, because this allows me to recover a complete type-stable result when I do have a homogeneous collection of MyType as the input argument.

I’m not focusing on why the implementations of piping foo1 and foo2 in the OP do not perform as fast as identity, but rather why they do not perform equally among themselves. It’s a bit surprising to me that such a “simple” case causes performance divergence for functionally equivalent syntax sugars of function piping as soon as it involves type-parameter manipulation.

A bit more info on why I have N as a type parameter in the first place instead of just storing them in a field of MyType: They dictate the length of some Tuple which will be constructed for downstream calculation. I can eliminate many allocations if I know N at compile time. However, it also seems that if I don’t know, simply propagating them through the type signature is already very inefficient.

Personally, I wish the dispatch system could treat parametric types just as efficiently as if-else branches on plain data when it fails to perform type inference AoT.

Have you seen LightSumTypes.jl? If you can limit yourself to some largest N then this package does exactly what you described (replace dynamic dispatch with if-else)

1 Like

Btw to explain this performance: You need to count the number the of dynamic dispatches. Since your container is abstractly typed, Julia doesn’t know what kind of Val each iteration gets. Additionally foo2 is type unstable. I am not quite sure how these facts interact though nor how |> is implemented. I am on mobile currently and cannot check.

ComposedFunction’s call aggressively @inlines its contained functions’ calls, so with type stability, all of it is eliminated.

julia> @code_llvm identity(Val(1))
; Function Signature: identity(Base.Val{1})
;  @ operators.jl:531 within `identity`
; Function Attrs: uwtable
define void @julia_identity_4722() #0 {
top:
  ret void
}

julia> @code_llvm (foo2∘foo1)(Val(1))
; Function Signature: (::Base.ComposedFunction{typeof(Main.foo2), typeof(Main.foo1)})(Base.Val{1})
;  @ operators.jl:1050 within `ComposedFunction`
; Function Attrs: uwtable
define void @julia_ComposedFunction_4719() #0 {
top:
  ret void
}

But you don’t have type stability with a Val-inferred input for the call, so compiler optimizations get iffier. For identity, the compiler is able to recognize the sole method and inline the code, so there isn’t even anything really happening in the loop. The loop probably only exists because there’s a possibility of an undefined reference, and it’s only semantically correct to hit that error. A foo1 call alone gets ignored, so it performs the same.

When the foo2 call gets involved, the calls don’t get entirely ignored anymore and we get varying degrees of runtime dispatch (@ijl_apply_generic) and associated allocations (which can be 0 if inputs and outputs are already boxed). foo1’s runtime dispatch is apparent, but even foo2 may also need runtime dispatch if it’s not ignored and N is inferred as Any.

It’s possible that with such an additional way to construct parametric types Val{N} where N isa Int and a likely manual specification of such an element type, type inference improves and aggressive inlining can do away with runtime dispatches. However, that’s a fairly brittle situation (the real versions of foo1 and foo2 may not be small enough even with @inline), and arguably the 1000 N::Int instances here shouldn’t be stored in the type domain. Types tell code how to handle their instances, so making thousands of something to do the same thing really suggests 1 type with thousands of instances, and the other way around will have significant optimization difficulties. More specifically, Val instances are intended for limited static information that can take up no space (sizeof(eltype(a)) == 0), but the abstract element type forces a to store a pointer for each element, which saved no space (sizeof(Ptr) == sizeof(Int)) at the cost of type stability and data locality. Sometimes inhomogeneous type collections are unavoidable, but it may also be possible to convert your types to 1 shared type in an isolated type-unstable step to optimize a processing step that handles them all the same way.

2 Likes

Thanks for the recommendation! It looks like what I need! I’ll test out!

Thanks for the explanation!!

I think this is a good mental model, and I think in a sense I was following it by defining the method foo1(::Val{N}) where {N}, since this means that I only want a single method defined for all intances of one type: Valit’s just that it is not a concrete but an abstract type. Although one could argue “same type” only applies to concrete types, as that’s how instances are directly identified by the dispatch system (except for Type and Function, of course).

However, in the case where the “different type” instances are differentiated by their type parameters, as long as these parameters contain enough type information about themselves useful for when they are treated as input values to a function, we can still have type-stable code:

julia> foo3(::Val{<:Integer}) = 1
foo3 (generic function with 1 method)

julia> foo3(::Val{<:AbstractFloat}) = 2
foo3 (generic function with 2 methods)

julia> typePool = (Float16, Float32, Float64, Int8, Int16, Int32, Int64)
(Float16, Float32, Float64, Int8, Int16, Int32, Int64)

julia> b = Val.(getindex.(Ref(typePool), rand(1:7, 1000)));

julia> @code_warntype foo3.(b)
MethodInstance for (::var"##dotfunction#231#2")(::Vector{Val})
  from (::var"##dotfunction#231#2")(x1) @ Main none:0
Arguments
  #self#::Core.Const(var"##dotfunction#231#2"())
  x1::Vector{Val}
Body::Vector{Int64}
1 ─ %1 = Base.broadcasted(Main.foo3, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(foo3), Tuple{Vector{Val}}}
│   %2 = Base.materialize(%1)::Vector{Int64}
└──      return %2

On the contrary, for non-Type parameters like N that do not (and currently cannot) carry any type information about themselves, whenever a function directly manipulates these parameters, the code becomes type unstable:

julia> @code_warntype foo1.(a)
MethodInstance for (::var"##dotfunction#240#1")(::Vector{Val})
  from (::var"##dotfunction#240#1")(x1) @ Main none:0
Arguments
  #self#::Core.Const(var"##dotfunction#240#1"())
  x1::Vector{Val}
Body::AbstractVector
1 ─ %1 = Base.broadcasted(Main.foo1, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(foo1), Tuple{Vector{Val}}}
│   %2 = Base.materialize(%1)::AbstractVector
└──      return %2

This is why I think not having complete type annotation capability for the signature of parametric types can also lead to avoidable performance overhead.

That indeed applies to my statement, and I should’ve clarified. Abstract types do have instances indirectly, but it’s the concrete type that describes the structure for the compiled code to work with.

Indeed and that’s a great demo of it. The compiler doesn’t just give up the moment types stop being concrete; in the case of foo3, it was able to recognize that there were only 2 methods it needed to consider, and both of them returned Int64. The runtime dispatch still occurs, but that’s a much more efficient output container. To further demonstrate that type inference is involved, we can define more methods with different return types that don’t involve Val, and they don’t affect the type inference:

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b),)) # right after your code
1-element Vector{Any}:
 Vector{Int64} (alias for Array{Int64, 1})

julia> foo3(::String) = typemax(Float32) # nothing to do with Val
foo3 (generic function with 3 methods)

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b),))
1-element Vector{Any}:
 Vector{Int64} (alias for Array{Int64, 1})

julia> foo3(::Symbol) = 4im # nothing to do with Val
foo3 (generic function with 4 methods)

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b),))
1-element Vector{Any}:
 Vector{Int64} (alias for Array{Int64, 1})

But the moment we add a Val input method with different return types, type inference changes in unpredictable ways. Your b never contains a Val{Symbol}() or Val{String}(), but since its type is Vector{Val}, the compiler doesn’t actually know anything about the various Val type parameters, only Val.

julia> foo3(::Val{Symbol}) = Int32(5) # not in b, but its type doesn't know
foo3 (generic function with 5 methods)

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b),))
1-element Vector{Any}:
 Union{Vector{Int32}, Vector{Int64}, Vector{Signed}}

julia> foo3(::Val{String}) = Int32(6)
foo3 (generic function with 6 methods)

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b),))
1-element Vector{Any}:
 AbstractVector{<:Signed} (alias for AbstractArray{<:Signed, 1})

It’s worth noting that this exceeded the default number (3 in v1.11.5) for the maximum number of methods considered when inferring calls with abstract input types (as documented with Base.Experimental.@max_methods). In other cases, I’ve observed the compiler falling back to Any inferences, but somehow this example keeps going (is it the small method bodies, the broadcasting?). But usually you can only count on this optimization for (often internal) functions with only a couple methods at most by design, like (::ComposedFunction).

Back to the demo, we can get our good type inference back if we make a vector with a Val type union that has enough information to narrow the method table down to the Int-returning methods again:

julia> b2 = (Val{T} where T<:Union{typePool...})[e for e in b];

julia> Base.return_types(Broadcast.BroadcastFunction(foo3), (typeof(b2),))
1-element Vector{Any}:
 Vector{Int64} (alias for Array{Int64, 1})