edit: There is the fresh new ManualDispatch.jl, a simple manual implementation of the same idea by @jlapeyre
This started as a simple benchmark to better understand the performance characteristics of dynamic dispatch, but I ended up trying out an old idea (of me? edit: no, of course) : generating potentially inlinable shortcuts to improve dynamic dispatch performance.
Wait, we already compile ahead of time (AOT). Why to JIT?
AOT is great, but it works only on types known at compilation time, which is, as the name suggests, ahead of runtime. Quite often it is not possible to exactly know the type of an object at that time, thus the compiler generates extra code to check the type at runtime and look for the appropriate method to call. This is dynamic dispatch, a source of occasional slowness in Julia.
The generated code - although very fast and already optimized (with e.g. union splitting) - is sometimes far from optimal and can be easily speed up by giving hints to the compiler. This PoC shows how.
The example
Having differently typed objects in an array is a typical case. Something like:
abstract type A{T} end
struct C1{T} <: A{T} end
struct C2{T} <: A{T} end
const q = [C1{Int}(), C2{Int}(), C1{Real}()]
Now let’s say we have a function that we want to apply to every item in this array. For simplicity, just count the C1
s and C2
s separately with a dynamically dispatched function called count_subtypes
.
const c1_count = Ref(0)
const c2_count = Ref(0)
reset() = (c1_count[] = 0; c2_count[] = 0)
count_subtypes(a::A) = nothing
count_subtypes(c1::C1) = (c1_count[] = c1_count[] + 1; nothing)
count_subtypes(c2::C2) = (c2_count[] = c2_count[] + 1; nothing)
using BenchmarkTools
@btime foreach(count_subtypes, q)
Which results in 24.918 ns (0 allocations: 0 bytes)
- Very good for 3 dispatch!
The JIT compiler
Now let’s try to speed this up even more by generating the simplest possible “compiled” version of count_subtypes
, which just checks at runtime if the given argument is of type, say C1
, and call the original function with this knowledge, allowing (more) static dispatch. Otherwise, just call the original function with dynamic dispatch.
function compile(op, fixtype2)
name = nameof(op)
compiled_name = Symbol("_comp_$(nameof(op))")
expr = quote
@inline function $compiled_name(arg1::A)
if arg1 isa $fixtype2
return $name(arg1)
end
return $name(arg1) # Same on both routes: JIT compilation is a no-op on the optimized code
end
end
return eval(expr)
end
The trick is that evaluating isa
is much faster than spinning up the whole dispatch machinery:
julia> compile(count_subtypes, C1)
_comp_count_subtypes (generic function with 1 method)
julia> @btime foreach(_comp_count_subtypes, q)
13.557 ns (0 allocations: 0 bytes)
Great, but this was still ahead of time!
So let’s try to recompile the function and select the optimal shortcut:
const NUM_TYPES = 2*10
const QUEUE_LENGTH = 100
# Create an `Array{A}` with random `C1{Val{i}}` and `C2{Val{j}}` elements
function createq(alpha = 0.7, ql = QUEUE_LENGTH, nt = NUM_TYPES)
@assert ql >= nt
return [rand() < alpha ? C1{Val(i % nt)}() : C2{Val(i % nt)}() for i = 1:ql]
end
for alpha = 0.1:0.1:0.9
print("\n$alpha: c1_count / c2_count goal: $(round((alpha / (1 - alpha)); digits=2)) Dynamic dispatch:")
@btime foreach(count_subtypes, q) setup=(reset(); q=createq($alpha))
# Based on empirical data, a sophisticated optimization decides what should run fast
# i.e which type to handle in a separate, possibly inlined route
fasttype = c1_count[] > c2_count[] ? C1 : C2
print(" Measured ratio: $(round(c1_count[] / c2_count[]; digits=2)) => JIT on $fasttype:")
compile(count_subtypes, fasttype)
# Voila, 2.5-30x speedup
@btime foreach(_comp_count_subtypes, q) setup=(reset(); q=createq($alpha))
end
Which outputs:
0.1: c1_count / c2_count goal: 0.11 Dynamic dispatch: 4.249 μs (0 allocations: 0 bytes)
Measured ratio: 0.06 => JIT on C2: 142.671 ns (0 allocations: 0 bytes)
0.2: c1_count / c2_count goal: 0.25 Dynamic dispatch: 4.355 μs (0 allocations: 0 bytes)
Measured ratio: 0.14 => JIT on C2: 431.571 ns (0 allocations: 0 bytes)
0.3: c1_count / c2_count goal: 0.43 Dynamic dispatch: 4.446 μs (0 allocations: 0 bytes)
Measured ratio: 0.43 => JIT on C2: 728.365 ns (0 allocations: 0 bytes)
0.4: c1_count / c2_count goal: 0.67 Dynamic dispatch: 4.477 μs (0 allocations: 0 bytes)
Measured ratio: 0.61 => JIT on C2: 1.068 μs (0 allocations: 0 bytes)
0.5: c1_count / c2_count goal: 1.0 Dynamic dispatch: 4.536 μs (0 allocations: 0 bytes)
Measured ratio: 0.89 => JIT on C2: 1.648 μs (0 allocations: 0 bytes)
0.6: c1_count / c2_count goal: 1.5 Dynamic dispatch: 4.564 μs (0 allocations: 0 bytes)
Measured ratio: 1.38 => JIT on C1: 1.052 μs (0 allocations: 0 bytes)
0.7: c1_count / c2_count goal: 2.33 Dynamic dispatch: 4.561 μs (0 allocations: 0 bytes)
Measured ratio: 1.94 => JIT on C1: 593.286 ns (0 allocations: 0 bytes)
0.8: c1_count / c2_count goal: 4.0 Dynamic dispatch: 4.537 μs (0 allocations: 0 bytes)
Measured ratio: 3.76 => JIT on C1: 412.985 ns (0 allocations: 0 bytes)
0.9: c1_count / c2_count goal: 9.0 Dynamic dispatch: 4.535 μs (0 allocations: 0 bytes)
Measured ratio: 11.5 => JIT on C1: 133.391 ns (0 allocations: 0 bytes)
Promising results, this could be usable on some real-world scenarios, but it is not clear, where the limits are, and a complex machinery may be needed to deploy it into existing programs. e.g.:
- Will it work for multiple arguments and complex types?
- What is the best way to overcome world age issues? (Seems like either the optimization must be run in the top-level scope, or a “jit-context” type must be added to dispatch.)
- Need to collect data
- Handling multiple shortcuts
- Optimizing in TypeN
etc.
All of this seems to be more than what I can do in my spare time. So I am writing here in the hope that somebody finds this endeavour interesting to join!
The full code: https://gist.github.com/tisztamo/b1db3f385ac75288412ef41fa7157cac