Programming Language Benchmark 2

The programmer attractivechaos has created a new benchmark comparing different programming languages, hosted at GitHub - attractivechaos/plb2: A programming language benchmark. It’s a quite nice benchmark, being small enough that it’s easy to dig into and microoptimise, while also being large enough that the differences between programming languages are not completely trivial.
Moreover, it happens to benchmark exactly the kind of problems that Julia ought to excel at!

Currentlly, Julia does quite well, and is only beaten by the statically compiled languages. However, Julia is still 47% slower than C - 8.09 vs 5.51 seconds. Startup time only accounts for about 1 to 1.5 seconds.
Given that we pride ourselves on our performance, and that these benchmarks mostly come down to simple array operations which should be essentially identical in C and Julia, I think it could be interesting to dig into where this performance difference come from. Perhaps it’s more representative that we would want to believe i.e, perhaps we can’t just expect Julia to match C even on simple array operations.

I’ve done a little digging myself and found some interesting differences from Julia to C:

Startup time

This only takes about 1 to 1.5 seconds in total for the 4 runs. On my computer this is improved by about 125 miliseconds from Julia 1.10 to Julia 1.11. So, while this is a real difference from C, it only accounts for a minority of the speed gap between Julia and C of 2.6 seconds.

Interestingly, when running the Julia scripts from command line in a loop, the runtime decreases steadily the first few runs, only stabilizing around the 4th run. I wonder why that is - perhaps Julia, on startup, accesses multiple files that need to be in the filesystem cache? Is it a Juliaup thing? This effect is quite significant - around there is a difference of around 1.6 seconds between the first and fourth run, 50x larger than the startup speed improvements in Julia 1.11. So perhaps something to look at there.

Memory allocation

Julia of course uses garbage collection. The reported time spent in garbage collection is small (less than 1% for the Sudoku benchmark), but the impact of Julia’s memory model is much larger. Specifically, when a function that allocates is called in a loop, the memory is not able to be re-used between the calls as it can be in languages with deterministic destruction. This severely impacts cache locality.

Moving the allocations outside the loop provides a significant speedup, even when the actual time taken for GC and allocating the arrays is small. Perhaps the ongoing (stalled?) work on escape analysis could allow Julia to reuse memory in the future.

Static arrays

C and Rust both provide statically sized arrays in the core language, and I also believe they are able to stack allocate these. Julia provides them through the StaticArrays package, but they are always backed by heap storage.
Statically knowing the array sizes makes Julia generate faster code. It may also interact with the memory allocation above - perhaps the static languages are able to allocate a fixed slab of memory once in the beginning of the program for these static arrays?
I wonder if it makes sense to have a low-level statically sized mutable buffer in the language, sort of like Memory{T}, but with a compile-time known length - Memory{T,N}?

Complicated bitshifts

Perhaps surprisingly, one of the benchmarks run 8% slower in Julia because Julia’s implementation of bitshifts don’t map directly onto the assembly instructions, but have special handling for negative bitshifts and bitshifts larger than the bitwise of the operand.
I’ve also found this to be a surprisingly large performance issue in my own code in BioSequences.jl and Kmers.jl
In contrast, C’s bitshifts correspond to the assembly instruction, and Rust provides explicitly wrapping bitshifts.
We do not currently have this in Julia - however, Keno suggested adding explicitly wrapping operations like +% to Julia. Perhaps this could be extended to e.g. >>%?

Slow push! and pop!

One of the benchmarks is significantly slowed down by the fact that Array is implemented in C, which means that much of its internal workings is written in C and can’t be inlined into Julia code. Notably, push! and pop! is slow in Julia. Thanks to work by Jameson Nash, Oscar Smith and others, this has been addressed for Julia 1.11, where this benchmark sees an important speedup.

Still more differences?

Unfortunately, all of these points don’t fully explain the performance gap from C to Julia. Even when I address all of them by not timing startup, using StaticArrays, using Julia 1.11, etc., C still runs faster. It also generates smaller assembly code, though I’m not able to understand from reading the assembly where these extra instructions come from. I’d be interested in hearing more.

23 Likes

IIRC from a previous thread, the immutable ones usually go on the stack, and the mutable ones can be moved to the stack if the compiler can prove it doesn’t escape. I don’t remember how to determine this from @code_llvm.

Why is there this discrepancy? What happens if you don’t do that special handling and are situations common where you’d want this special handling?

I’ve noticed this in a previous thread comparing elementwise initialization of uninitialized arrays with push!ing to a sizehint!ed empty array. Has that reached parity?

3 Likes

The way bitshifting works in CPUs, is that if I do x << y, and x has 32 bits, then only the lower 5 bits of y matters for the bitshift. The other bits in y are ignored. This is because x can only be shifted by at most 31, and 31 is 0b11111. Semantically, this is equivalent to x << mod(y, 8 * sizeof(x)).

In Julia, we support negative bitshifts, where shifting left by -n bits is the same as shifting right by n bits. We also support bitshifting x by 32 bits or more, which zeros out x. So the implementation of << in Julia is, in pseudo-Julia:

if y < 0
    x >> abs(y)
elseif y > 8 * sizeof(x)
    zero(x)
else
    native_bitshift(x, y)
end

This is of course slower than just the bitshift, which is a single CPU instruction. I don’t know of any situations where this is useful. Normally, when people bother with bitshifts instead of mathematical functions like div and mod, it’s because they want to write code close to the metal. However, I do see how Julia’s definition is semantically more correct than the assembly one. However, I still think we need to have the one with top performance available in Julia.

It hasn’t reached parity, and it never will be. It’s always faster to allocate an array once and fill it in, than to repeatedly push to it, even if pushing was optimally fast. However, pushing is still something like 3x faster in 1.11 compared to 1.10, and this difference also matters for collecting iterators with an unknown size.

I.e, Julia 1.10:

julia> @btime collect((i for i in 1:100000 if i < 100000000))
  529.957 μs (12 allocations: 1.83 MiB)

Julia 1.11:

julia> @btime collect((i for i in 1:100000 if i < 100000000))
  181.062 μs (18 allocations: 1.83 MiB)
7 Likes

By adding inlining with external C code, or by moving push! and pop! into LLVM compilation directly?

I’m reading now that these are considered undefined behavior in C but apparently processors often modulo the width of the type. I suppose defining behavior in cheap branches is just a design choice, like how integer division by 0 throws an error instead of being UB? Still, maybe there can be a package for the choices closer to the metal that document the differences and limitations, including macros that swap out standard operators like @views swaps out getindex.

UnsafeAssume.jl comes to mind, but really an additional package is unnecessary in this case. Just mask the RHS of the shift:

shift_naive(shift::S, a, b) where {S} = @inline shift(a, b)

function shift_masked(shift::S, a::T, b) where {S,T}
  @inline begin
    mask = 8*sizeof(T) - 1
    c = b & mask
    shift(a, c)
  end
end

On an AMD64 (x86-64) machine:

julia> @code_native debuginfo=:none dump_module=false raw=false shift_naive(<<, 1, 1)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        xor     ecx, ecx
        cmp     rsi, 64
        mov     edx, 63
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        shlx    rax, rdi, rsi
        cmovb   rcx, rax
        mov     rax, rsi
        neg     rax
        cmp     rax, 63
        cmovb   rdx, rax
        test    rsi, rsi
        sarx    rax, rdi, rdx
        cmovns  rax, rcx
        pop     rbp
        ret
        nop

julia> @code_native debuginfo=:none dump_module=false raw=false shift_masked(<<, 1, 1)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        shlx    rax, rdi, rsi
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]

The above is for <<, but the result is the same for >>>.

So no, given the above it’s clear no changes to Julia are necessary.

4 Likes

Had a hunch this might’ve been a thread already. So the benchmark can be optimized with additional masking right off the bat?

3 Likes

This is incorrect btw. Whether they end up on the stack or the heap depends on the program and the optimizer. If an MArray does not escape, it can typically be stack allocated.

5 Likes

I would just use your package Bumper.jl. Since reusing pointer will keep them in “hot cache”, so performance will not be that different from stack allocation.

I wanted to see what the timings are for my M2 Pro Mac Mini (base config).

hyperfine 'julia nqueen.jl'
Benchmark 1: julia nqueen.jl
  Time (mean ± σ):      2.722 s ±  0.007 s    [User: 3.858 s, System: 0.160 s]
  Range (min … max):    2.711 s …  2.734 s    10 runs

hyperfine 'julia matmul.jl'
Benchmark 1: julia matmul.jl
  Time (mean ± σ):     697.0 ms ±   4.2 ms    [User: 1803.5 ms, System: 219.4 ms]
  Range (min … max):   690.5 ms … 703.4 ms    10 runs

hyperfine 'julia sudoku.jl'
Benchmark 1: julia sudoku.jl
  Time (mean ± σ):      2.162 s ±  0.006 s    [User: 3.292 s, System: 0.169 s]
  Range (min … max):    2.154 s …  2.172 s    10 runs

hyperfine 'julia bedcov.jl'
Benchmark 1: julia bedcov.jl
  Time (mean ± σ):      1.743 s ±  0.018 s    [User: 2.854 s, System: 0.197 s]
  Range (min … max):    1.711 s …  1.764 s    10 runs

Edit: Julia 1.10.0

The reported times for M1 Macbook Pro in the readme were
3.02 0.76 2.35 1.96

The opposite actually. The gains mostly came from moving the code from C to Julia. The compiler can’t reason across the Julia to C barrier so having the resizing code in C meant that you always had to perform a function call for what in the average case did no work (since almost all of the time, the resize is just adding 1 to the size).

5 Likes

I think this demonstrates the allocation issue. Ideally the compiler would reuse the space of zs instead of allocating for each iteration.

using BenchmarkTools

function f(n, k)
    m = 0.0
    for _ in 1:n
        zs = randn(k)
        m = max(m, maximum(zs))
    end
    m
end

function g(n, k)
    m = 0.0
    zs = randn(k)
    for _ in 1:n
        randn!(zs)
        m = max(m, maximum(zs))
    end
    m
end


@benchmark f(1000000, 1000)
@benchmark g(1000000, 1000)


julia> @benchmark f(1000000, 1000)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.451 s …   4.486 s  ┊ GC (min … max): 6.14% … 6.01%
 Time  (median):     4.469 s              ┊ GC (median):    6.07%
 Time  (mean ± σ):   4.469 s ± 25.014 ms  ┊ GC (mean ± σ):  6.07% ± 0.09%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.45 s         Histogram: frequency by time        4.49 s <

 Memory estimate: 7.57 GiB, allocs estimate: 1000000.

julia> @benchmark g(1000000, 1000)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.845 s …  3.853 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.849 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.849 s ± 6.039 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                      █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.84 s        Histogram: frequency by time        3.85 s <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

@jakobnissen How does this come up in the PLB2 benchmark? Shouldn’t we always be able to pass an array in rather than allocating in a loop?

Question, if I may: what does it means that “an array escapes”? Understanding how to program mutables on the stack would be very interesting…

Basically a pointer or reference “escaping” means that it’s used in other places than just in the function body where its array is defined:

2 Likes

Thank you. I am going to experiment with MArrays (so far, the concept made no sense to me…)

1 Like

You don’t need StaticArrays.jl for experimenting with that, just use Array (or Vector).

I woke up this morning believing that, for a variable to be allocated on the stack, two criteria must be met:

  1. the variable is immutable (update, or mutable, but does not escape)
  2. the size and structure (where is what) of the variable must be known at compile time, so that within a method instance, addressing data is done by adding a compile-time constant plus an array-element-offset to the stack-head pointer.

Having shattered 1), you now proceed to batter 2)… and it’s not even near lunch time here! :grinning:

2 Likes

Ah, sorry, it seems I was wrong:

julia> function f(n::T) where {T}
         m = clamp(n, one(T)::T, T(20)::T)
         v = Vector{Float64}(undef, m)

         for i ∈ eachindex(v)
           v[i] = 3*i + 2
         end

         g = x -> x*x
         mapreduce(g, +, v)
       end
f (generic function with 1 method)

julia> @allocated f(10)
144

julia> @allocated f(10)
144

But this should get better as the Julia implementation is improved, this devdoc page seems relevant: EscapeAnalysis · The Julia Language