The programmer attractivechaos has created a new benchmark comparing different programming languages, hosted at GitHub - attractivechaos/plb2: A programming language benchmark. It’s a quite nice benchmark, being small enough that it’s easy to dig into and microoptimise, while also being large enough that the differences between programming languages are not completely trivial.
Moreover, it happens to benchmark exactly the kind of problems that Julia ought to excel at!
Currentlly, Julia does quite well, and is only beaten by the statically compiled languages. However, Julia is still 47% slower than C - 8.09 vs 5.51 seconds. Startup time only accounts for about 1 to 1.5 seconds.
Given that we pride ourselves on our performance, and that these benchmarks mostly come down to simple array operations which should be essentially identical in C and Julia, I think it could be interesting to dig into where this performance difference come from. Perhaps it’s more representative that we would want to believe i.e, perhaps we can’t just expect Julia to match C even on simple array operations.
I’ve done a little digging myself and found some interesting differences from Julia to C:
Startup time
This only takes about 1 to 1.5 seconds in total for the 4 runs. On my computer this is improved by about 125 miliseconds from Julia 1.10 to Julia 1.11. So, while this is a real difference from C, it only accounts for a minority of the speed gap between Julia and C of 2.6 seconds.
Interestingly, when running the Julia scripts from command line in a loop, the runtime decreases steadily the first few runs, only stabilizing around the 4th run. I wonder why that is - perhaps Julia, on startup, accesses multiple files that need to be in the filesystem cache? Is it a Juliaup thing? This effect is quite significant - around there is a difference of around 1.6 seconds between the first and fourth run, 50x larger than the startup speed improvements in Julia 1.11. So perhaps something to look at there.
Memory allocation
Julia of course uses garbage collection. The reported time spent in garbage collection is small (less than 1% for the Sudoku benchmark), but the impact of Julia’s memory model is much larger. Specifically, when a function that allocates is called in a loop, the memory is not able to be re-used between the calls as it can be in languages with deterministic destruction. This severely impacts cache locality.
Moving the allocations outside the loop provides a significant speedup, even when the actual time taken for GC and allocating the arrays is small. Perhaps the ongoing (stalled?) work on escape analysis could allow Julia to reuse memory in the future.
Static arrays
C and Rust both provide statically sized arrays in the core language, and I also believe they are able to stack allocate these. Julia provides them through the StaticArrays package, but they are always backed by heap storage.
Statically knowing the array sizes makes Julia generate faster code. It may also interact with the memory allocation above - perhaps the static languages are able to allocate a fixed slab of memory once in the beginning of the program for these static arrays?
I wonder if it makes sense to have a low-level statically sized mutable buffer in the language, sort of like Memory{T}
, but with a compile-time known length - Memory{T,N}
?
Complicated bitshifts
Perhaps surprisingly, one of the benchmarks run 8% slower in Julia because Julia’s implementation of bitshifts don’t map directly onto the assembly instructions, but have special handling for negative bitshifts and bitshifts larger than the bitwise of the operand.
I’ve also found this to be a surprisingly large performance issue in my own code in BioSequences.jl and Kmers.jl
In contrast, C’s bitshifts correspond to the assembly instruction, and Rust provides explicitly wrapping bitshifts.
We do not currently have this in Julia - however, Keno suggested adding explicitly wrapping operations like +%
to Julia. Perhaps this could be extended to e.g. >>%
?
Slow push!
and pop!
One of the benchmarks is significantly slowed down by the fact that Array
is implemented in C, which means that much of its internal workings is written in C and can’t be inlined into Julia code. Notably, push!
and pop!
is slow in Julia. Thanks to work by Jameson Nash, Oscar Smith and others, this has been addressed for Julia 1.11, where this benchmark sees an important speedup.
Still more differences?
Unfortunately, all of these points don’t fully explain the performance gap from C to Julia. Even when I address all of them by not timing startup, using StaticArrays, using Julia 1.11, etc., C still runs faster. It also generates smaller assembly code, though I’m not able to understand from reading the assembly where these extra instructions come from. I’d be interested in hearing more.