I don’t know how true this is, the author certainly didn’t share a benchmark. Theoretically, dispatching on a tuple of multiple types is more complicated than dispatching on a single type; on top of that, AOT compilers know all of the concrete types that the program can dispatch to. But it’s actually extremely difficult to benchmark the runtime dispatch itself without getting thrown off by nearby code (like the function being called). Python is also not AOT-compiled, and its single dispatch is not a fixed vtable, but dynamically sized dictionaries.
A couple weeks ago, I benchmarked a dynamic dispatch of +
, the LLVM IR seems close enough to it anyway. It added 14.6ns to 2ns integer addition; function barriers are incredible for flexibility and interactivity, but we don’t want to go through them this frequently unless each call takes much much longer.
I tried to do something similar in Python, but I wasn’t sure how to benchmark without call overheads. So I just benchmarked accessing the bound method prior to the call:
>>> class A:
... def foo(self): pass
...
>>> a = A()
>>> import timeit
>>> timeit.timeit("a.foo", globals=globals(), number =1000000) # seconds
0.17925379995722324
>>> timeit.timeit("a", globals=globals(), number =1000000)
0.04503010003827512
So 179ns per method access, 134ns if we don’t have to look up a
. Seems like a lot, so I checked a plainer dictionary access:
>>> timeit.timeit("a['foo']", setup="a={'foo':print}", number=1000000)
0.07713829993735999
So 77.1ns, which doesn’t match. Then I remembered that each method call instantiates a bound method forwarding to the actual method in the class. That can’t be optimized away at (bytecode-)compile-time either because foo
can be reassigned in the class or the instance any time.
Of course this doesn’t address how the performance scales, and it’s hard to guess without knowledge of what dispatch actually does. But the speed of even a fraction of CPython’s single dispatch seems greatly exaggerated.