It’s not actually easy to separate the timing of the runtime dispatches from the surrounding code, and I wasn’t much convinced by the few benchmarks people have tried, even if just counting the ones did control the number of dispatches. I can’t just time jl_apply_generic
either because a runtime dispatch also involves preparing a call.
Making the type and 6x14x14 methods
# Julia v1.11.2
struct Call2arg{F,X,Y}
f::F
x::X
y::Y
end
@inline (c::Call2arg)() = c.f(c.x, c.y)
# contained source of poor type inference
const Call2num = Call2arg{Function, Number, Number}
funcs, inputs = let
funcnames = Symbol.(collect("qwerty"))
numtypes = [Bool, Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64,
UInt64, Int128, UInt128, Float16, Float32, Float64]
methodcount=0
for func in funcnames
for type1 in numtypes
for type2 in numtypes
methodcount += 1
@eval $func(::$(Symbol(type1)), ::$(Symbol(type2))) = $methodcount
end
end
end
@show methodcount
eval.(funcnames), one.(numtypes)
end
Attempt 1: simple benchmark loop. The @code_llvm
reports suggest the type-unstable call just loads up a call frame and calls jl_apply_generic
, so I’m fairly confident the only extraneous code are the calls themselves, which all just return an Int
.
julia> using BenchmarkTools
julia> @btime $(Call2num(r, 1, 2.2))() # loop of runtime dispatch
11.311 ns (0 allocations: 0 bytes)
700
julia> @btime $(Call2arg(r, 1, 2.2))(); # loop of loading `Int`
1.000 ns (0 allocations: 0 bytes)
which suggests runtime dispatch adds ~10ns on my system.
Attempt 2: I tried to use a setup
and evals=1
to randomize the call before each timing, but the weird results reminded me that a low number of evaluations per timed sample cannot accurately estimate such short timings.
Attempt 3: Timed a loop of randomized calls so it lasts long enough for decent timing. I moved as much of the randomization work outside the benchmark as possible.
julia> function loopcalls(randcalls)
local x
for i in eachindex(randcalls) x = randcalls[i]() end;
x
end;
julia> randcalls = let n=1_000_000
Call2num.(rand(funcs,n), rand(inputs,n), rand(inputs,n))
end;
julia> @btime loopcalls($randcalls)
179.658 ms (0 allocations: 0 bytes)
258
It was hard to do a runtime dispatch-less version that didn’t optimize the loop away, so I had to do a non-equivalent loading of Int64
.
julia> function loopblah(randcalls)
local x
for i in eachindex(randcalls) x = randcalls[i].x end
x
end;
julia> randcalls = let n=1_000_000
Call2arg.(r, rand(Int64,n), rand(Float64,n))
end;
julia> @btime loopblah($randcalls)
32.100 μs (0 allocations: 0 bytes)
193604991833945415
The @code_llvm
printout diverged much more so I’m not sure if this isolated the iteration overhead, but assuming it’s really this negligible, this suggests randomized runtime dispatch of these methods adds ~180ns. To justify that the randomization is responsible for the 10 to 180ns difference, the 3rd benchmark attempt can be done to resemble the 1st:
julia> randcalls = let n=1_000_000
Call2num.(Ref(r), rand(Int,n), rand(Float64,n))
end;
julia> @btime loopcalls($randcalls)
11.154 ms (0 allocations: 0 bytes)
700
and we get the expected performance. I don’t actually know why this happens to the same code and method tables, but CPU caching has been suggested. However, omitting some functions and inputs from the random sampling (same million calls) seems to smoothly decrease the runtime, and I don’t know how that could be explained by caches.
You probably saw this coming, but the motivation was a comparison with CPython’s single dispatch; you might have heard about a JIT compiler, but the available binaries aren’t built with it so the dispatches are all runtime here.
>>> class A: # v3.13.0
... def foo(self): return 1
...
>>> a = A()
>>> import timeit
>>> timeit.timeit("a.foo()", globals=globals(), number = 1_000_000) # seconds
0.03291300009004772
which adds ~33ns, unrandomized.
Python doesn’t have numeric primitives and metaprogramming like Julia’s, I can’t make an equivalent to the 2-argument dispatch, and there’s no compiler to optimize the property accesses, but
I tried to generate more heavily monkeypatched classes to make up for it.
funcnames = "qwerty"
halfclasses = "ZXCVBNMASDFGHJ"
methodcount = 0
inputs = []
for halfclass1 in halfclasses:
for halfclass2 in halfclasses:
exec(f"""
class {halfclass1}{halfclass2}: pass
""")
inputs.append(eval(f"{halfclass1}{halfclass2}")())
for func in funcnames:
methodcount += 1
exec(f"""
def {func}(self): return {methodcount}
{halfclass1}{halfclass2}.{func} = {func}
""")
import random
n = 1_000_000
randinputs = random.choices(inputs, k=n)
randfuncs = random.choices(funcnames, k=n)
def loopcalls(randfuncs, randinputs):
for i in range(len(randfuncs)):
x = getattr(randinputs[i], randfuncs[i])()
return x
def loopblah(randfuncs, randinputs):
for i in range(len(randfuncs)):
x = 1
return x
import timeit
>>> timeit.timeit("loopcalls(randfuncs,randinputs)", globals=globals(), number = 1)
0.16309330007061362
>>> timeit.timeit("loopblah(randfuncs,randinputs)", globals=globals(), number = 1)
0.018459899933077395
which adds ~145ns, randomized. When I omit methods and classes, the runtime decreases, too, though going to the extreme only gets to the added ~64ns, it doesn’t match the 1-call benchmark’s 33ns.
>>> randinputs = random.choices(inputs[:1], k=n)
>>> randfuncs = random.choices(funcnames[:1], k=n)
>>> timeit.timeit("loopcalls(randfuncs,randinputs)", globals=globals(), number = 1)
0.09056359995156527
>>> timeit.timeit("loopblah(randfuncs,randinputs)", globals=globals(), number = 1)
0.026401199982501566
So while you’d expect single dispatch to be more efficient than multiple dispatch, Python’s dynamism requires a costlier implementation. This 1176-method benchmark doesn’t show a significant difference (33-145ns vs 10-180ns), but the number of methods recently called does affect runtime and it could scale very differently over that or other factors for all I know.