In the spirit of Cunningham’s Law [1], I cobbled together a basic dynamic dispatch benchmark in Python, Julia and Rust [2].
Python Code
import itertools
import statistics
import timeit
class Letter:
__slots__ = "i"
def __init__(self, i):
self.i = i
def dostuff(self, x): ...
class A(Letter):
def dostuff(self, x):
return x + self.i
class B(Letter):
def dostuff(self, x):
return x + self.i
class C(Letter):
def dostuff(self, x):
return x + self.i
class D(Letter):
def dostuff(self, x):
return x + self.i
def bench(xs, i):
res = i
for x in xs:
res = x.dostuff(res)
return res
def show_results(results):
print(f"min ... max: {min(results):.3f} ns ... {max(results):.3f} ns")
print(f"median: {statistics.median(results):.3f} ns")
print(f"mean +- stddev: {statistics.mean(results):.3f} ns +- {statistics.stdev(results):.3f} ns")
if __name__ == "__main__":
items = [T(1) for T in itertools.islice(itertools.cycle([A, B, C, D]), 20)]
nsamp, neval = 10000, 1000
ns_multiplier = 1e9
results = timeit.repeat("bench(items, 1)", repeat=nsamp, number=neval, globals=globals())
show_results([result * ns_multiplier / neval for result in results])
print()
# Avoid small int cache
results = timeit.repeat("bench(items, 100000)", repeat=nsamp, number=neval, globals=globals())
show_results([result * ns_multiplier / neval for result in results])
Julia Code
using BenchmarkTools
abstract type Letter end
struct A <: Letter
i::Int
end
struct B <: Letter
i::Int
end
struct C <: Letter
i::Int
end
struct D <: Letter
i::Int
end
# > 3 methods to disable world-splitting heuristic (max_methods)
dostuff(a::A, x) = x + a.i
dostuff(b::B, x) = x + b.i
dostuff(c::C, x) = x + c.i
dostuff(d::D, x) = x + d.i
function bench(xs, i)
res = i
for x in xs
res = dostuff(x, res)
end
return res
end
let items = [T(1) for T in repeat([A, B, C, D], 5)]
display(@benchmark bench($items, i) setup=(i=1))
println()
# Avoid small Int cache
display(@benchmark bench($items, i) setup=(i=100000))
println()
end
Rust Code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
trait Letter {
fn dostuff(&self, x: isize) -> isize;
}
struct A {
i: isize
}
struct B {
i: isize
}
struct C {
i: isize
}
struct D {
i: isize
}
impl Letter for A {
fn dostuff(&self, x: isize) -> isize {
x + self.i
}
}
impl Letter for B {
fn dostuff(&self, x: isize) -> isize {
x + self.i
}
}
impl Letter for C {
fn dostuff(&self, x: isize) -> isize {
x + self.i
}
}
impl Letter for D {
fn dostuff(&self, x: isize) -> isize {
x + self.i
}
}
fn bench(xs: &Vec<Box<dyn Letter>>, i: isize) -> isize {
let mut res = i;
for x in xs.iter() {
res = x.dostuff(res)
}
return res
}
fn criterion_benchmark(c: &mut Criterion) {
let mut xs: Vec<Box<dyn Letter>> = Vec::new();
for i in 0..20 {
match i % 4 {
0 => xs.push(Box::new(A { i: 1 })),
1 => xs.push(Box::new(B { i: 1 })),
2 => xs.push(Box::new(C { i: 1 })),
3 => xs.push(Box::new(D { i: 1 })),
_ => unreachable!(),
}
}
c.bench_function("dynamic dispatch with virtual methods", |b| b.iter(|| {
let i = black_box(100000);
bench(black_box(&xs), i)
}));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Results:
> python bench.py
min ... max: 1156.847 ns ... 2555.920 ns
median: 1161.976 ns
mean +- stddev: 1165.322 ns +- 36.903 ns
min ... max: 1304.791 ns ... 2294.531 ns
median: 1360.398 ns
mean +- stddev: 1364.048 ns +- 19.811 ns
> julia bench.jl
BenchmarkTools.Trial: 10000 samples with 251 evaluations.
Range (min … max): 299.191 ns … 504.641 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 300.422 ns ┊ GC (median): 0.00%
Time (mean ± σ): 302.278 ns ± 10.312 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▇█ ▂
███▇▃▅▄▇█▆▆▅▅▅▅▅▅▆▆▅▆▅▆▇▄▅▅▅▅▃▆▅▅▅▅▅▅▅▄▅▄▆▅▄▄▅▃▅▅▄▄▅▄▃▄▄▄▁▄▄▅ █
299 ns Histogram: log(frequency) by time 359 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
Range (min … max): 386.480 ns … 2.449 μs ┊ GC (min … max): 0.00% … 77.01%
Time (median): 398.871 ns ┊ GC (median): 0.00%
Time (mean ± σ): 408.175 ns ± 72.037 ns ┊ GC (mean ± σ): 0.58% ± 2.81%
▁ ██▆▄▂▁ ▁▁ ▁
█▇▆▅███████▆▇▆█▇▆▇▆▆█▇▇▅▅▄▅▄▅▅▅▅▅▆▇█▇▇▆▅▅▄▄▂▅▅▅▇█████▇▅▄▂▆▅▇ █
386 ns Histogram: log(frequency) by time 506 ns <
Memory estimate: 336 bytes, allocs estimate: 21.
> cargo bench
...
Running benches/bench_dyn_dispatch.rs (target/release/deps/bench_dyn_dispatch-baccd544aa3dd6a1)
Benchmarking dynamic dispatch with virtual methods: Collecting 100 samples in estimated 5.0dynamic dispatch with virtual methods
time: [27.381 ns 27.393 ns 27.409 ns]
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) high mild
14 (14.00%) high severe
Note that there are two benchmark runs for Python and Julia because both languages employ small integer caching. This lets us disentangle some of the overhead related to boxing arguments for dynamic dispatch from everything else.
So at least on this example, untyped Julia is quite a bit faster. I genuinely tried multiple permutations of the code to reduce Python-side overhead and make things less advantageous for Julia, but the gap remained. I’m sure there are real world examples showing the opposite, but I couldn’t think of where to start there. If anyone has experience this, here’s your incentive to prove me wrong!
More interesting IMO is the comparison between Julia and Rust. Even when allocation overhead is eliminated by sticking to small integers, single dispatch with vtables in Rust is an order of magnitude faster. What’s surprising is that a perf gap exists despite the optimizations Julia has for method lookup. Quoting from Julia Functions · The Julia Language :
The next issue concerns the structure of method cache hash tables. Empirical studies show that the vast majority of dynamically-dispatched calls involve one or two arguments. In turn, many of these cases can be resolved by considering only the first argument. (Aside: proponents of single dispatch would not be surprised by this at all. However, this argument means “multiple dispatch is easy to optimize in practice”, and that we should therefore use it, not “we should use single dispatch”!) So the method cache uses the type of the first argument as its primary key. Note, however, that this corresponds to the second element of the tuple type for a function call (the first element being the type of the function itself). Typically, type variation in head position is extremely low – indeed, the majority of functions belong to singleton types with no parameters. However, this is not the case for constructors, where a single method table holds constructors for every type. Therefore the
Type
method table is special-cased to use the first tuple type element instead of the second.
One can see the exact implementation of this hash table and the rest of the method lookup logic in julia/src/gf.c at v1.10.0 · JuliaLang/julia · GitHub.
Thus my follow-up question is why we’re so much slower. Is it the speed of method lookup or something else? Is there any low-hanging fruit which could be optimized here?