Does Julia Create a "1.5" Language Problem?

In the spirit of Cunningham’s Law [1], I cobbled together a basic dynamic dispatch benchmark in Python, Julia and Rust [2].

Python Code
import itertools
import statistics
import timeit

class Letter:
    __slots__ = "i"
    def __init__(self, i):
        self.i = i

    def dostuff(self, x): ...

class A(Letter):
    def dostuff(self, x):
        return x + self.i

class B(Letter):
    def dostuff(self, x):
        return x + self.i

class C(Letter):
    def dostuff(self, x):
        return x + self.i

class D(Letter):
    def dostuff(self, x):
        return x + self.i

def bench(xs, i):
    res = i
    for x in xs:
        res = x.dostuff(res)
    return res

def show_results(results):
    print(f"min ... max: {min(results):.3f} ns ... {max(results):.3f} ns")
    print(f"median: {statistics.median(results):.3f} ns")
    print(f"mean +- stddev: {statistics.mean(results):.3f} ns +- {statistics.stdev(results):.3f} ns")

if __name__ == "__main__":
    items = [T(1) for T in itertools.islice(itertools.cycle([A, B, C, D]), 20)]
    nsamp, neval = 10000, 1000
    ns_multiplier = 1e9

    results = timeit.repeat("bench(items, 1)", repeat=nsamp, number=neval, globals=globals())
    show_results([result * ns_multiplier / neval for result in results])
    print()
    # Avoid small int cache
    results = timeit.repeat("bench(items, 100000)", repeat=nsamp, number=neval, globals=globals())
    show_results([result * ns_multiplier / neval for result in results])
Julia Code
using BenchmarkTools

abstract type Letter end

struct A <: Letter
    i::Int
end
struct B <: Letter
    i::Int
end
struct C <: Letter
    i::Int
end
struct D <: Letter
    i::Int
end

# > 3 methods to disable world-splitting heuristic (max_methods)
dostuff(a::A, x) = x + a.i
dostuff(b::B, x) = x + b.i
dostuff(c::C, x) = x + c.i
dostuff(d::D, x) = x + d.i

function bench(xs, i)
    res = i
    for x in xs
        res = dostuff(x, res)
    end
    return res
end

let items = [T(1) for T in repeat([A, B, C, D], 5)]
    display(@benchmark bench($items, i) setup=(i=1))
    println()
    # Avoid small Int cache
    display(@benchmark bench($items, i) setup=(i=100000))
    println()
end
Rust Code
use criterion::{black_box, criterion_group, criterion_main, Criterion};

trait Letter {
    fn dostuff(&self, x: isize) -> isize;
}
struct A {
    i: isize
}
struct B {
    i: isize
}
struct C {
    i: isize
}
struct D {
    i: isize
}
impl Letter for A {
    fn dostuff(&self, x: isize) -> isize {
        x + self.i
    }
}
impl Letter for B {
    fn dostuff(&self, x: isize) -> isize {
        x + self.i
    }
}
impl Letter for C {
    fn dostuff(&self, x: isize) -> isize {
        x + self.i
    }
}
impl Letter for D {
    fn dostuff(&self, x: isize) -> isize {
        x + self.i
    }
}

fn bench(xs: &Vec<Box<dyn Letter>>, i: isize) -> isize {
    let mut res = i;
    for x in xs.iter() {
        res = x.dostuff(res)
    }
    return res
}

fn criterion_benchmark(c: &mut Criterion) {
    let mut xs: Vec<Box<dyn Letter>> = Vec::new();
    for i in 0..20 {
        match i % 4 {
            0 => xs.push(Box::new(A { i: 1 })),
            1 => xs.push(Box::new(B { i: 1 })),
            2 => xs.push(Box::new(C { i: 1 })),
            3 => xs.push(Box::new(D { i: 1 })),
            _ => unreachable!(),
        }
    }
    c.bench_function("dynamic dispatch with virtual methods", |b| b.iter(|| {
        let i = black_box(100000);
        bench(black_box(&xs), i)
    }));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Results:

> python bench.py 
min ... max: 1156.847 ns ... 2555.920 ns
median: 1161.976 ns
mean +- stddev: 1165.322 ns +- 36.903 ns

min ... max: 1304.791 ns ... 2294.531 ns
median: 1360.398 ns
mean +- stddev: 1364.048 ns +- 19.811 ns
> julia bench.jl
BenchmarkTools.Trial: 10000 samples with 251 evaluations.
 Range (min … max):  299.191 ns … 504.641 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     300.422 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   302.278 ns ±  10.312 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▇█                                                           ▂
  ███▇▃▅▄▇█▆▆▅▅▅▅▅▅▆▆▅▆▅▆▇▄▅▅▅▅▃▆▅▅▅▅▅▅▅▄▅▄▆▅▄▄▅▃▅▅▄▄▅▄▃▄▄▄▁▄▄▅ █
  299 ns        Histogram: log(frequency) by time        359 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (min … max):  386.480 ns …  2.449 μs  ┊ GC (min … max): 0.00% … 77.01%
 Time  (median):     398.871 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   408.175 ns ± 72.037 ns  ┊ GC (mean ± σ):  0.58% ±  2.81%

  ▁    ██▆▄▂▁                                       ▁▁         ▁
  █▇▆▅███████▆▇▆█▇▆▇▆▆█▇▇▅▅▄▅▄▅▅▅▅▅▆▇█▇▇▆▅▅▄▄▂▅▅▅▇█████▇▅▄▂▆▅▇ █
  386 ns        Histogram: log(frequency) by time       506 ns <

 Memory estimate: 336 bytes, allocs estimate: 21.
> cargo bench
...
     Running benches/bench_dyn_dispatch.rs (target/release/deps/bench_dyn_dispatch-baccd544aa3dd6a1)
Benchmarking dynamic dispatch with virtual methods: Collecting 100 samples in estimated 5.0dynamic dispatch with virtual methods
                        time:   [27.381 ns 27.393 ns 27.409 ns]
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) high mild
  14 (14.00%) high severe

Note that there are two benchmark runs for Python and Julia because both languages employ small integer caching. This lets us disentangle some of the overhead related to boxing arguments for dynamic dispatch from everything else.

So at least on this example, untyped Julia is quite a bit faster. I genuinely tried multiple permutations of the code to reduce Python-side overhead and make things less advantageous for Julia, but the gap remained. I’m sure there are real world examples showing the opposite, but I couldn’t think of where to start there. If anyone has experience this, here’s your incentive to prove me wrong!

More interesting IMO is the comparison between Julia and Rust. Even when allocation overhead is eliminated by sticking to small integers, single dispatch with vtables in Rust is an order of magnitude faster. What’s surprising is that a perf gap exists despite the optimizations Julia has for method lookup. Quoting from Julia Functions · The Julia Language :

The next issue concerns the structure of method cache hash tables. Empirical studies show that the vast majority of dynamically-dispatched calls involve one or two arguments. In turn, many of these cases can be resolved by considering only the first argument. (Aside: proponents of single dispatch would not be surprised by this at all. However, this argument means “multiple dispatch is easy to optimize in practice”, and that we should therefore use it, not “we should use single dispatch”!) So the method cache uses the type of the first argument as its primary key. Note, however, that this corresponds to the second element of the tuple type for a function call (the first element being the type of the function itself). Typically, type variation in head position is extremely low – indeed, the majority of functions belong to singleton types with no parameters. However, this is not the case for constructors, where a single method table holds constructors for every type. Therefore the Type method table is special-cased to use the first tuple type element instead of the second.

One can see the exact implementation of this hash table and the rest of the method lookup logic in julia/src/gf.c at v1.10.0 · JuliaLang/julia · GitHub.

Thus my follow-up question is why we’re so much slower. Is it the speed of method lookup or something else? Is there any low-hanging fruit which could be optimized here?


  1. “The best way to get the right answer on the Internet is not to ask a question; it’s to post the wrong answer.” ↩︎

  2. Why Rust? Unlike Java, it’s easy to verify that the compiler is not devirtualizing the calls being made here. And unlike C, I can actually write correct Rust code :stuck_out_tongue:. ↩︎

12 Likes