Chairmarks.jl

Thank you for bringing this up. Getting the right answer is very important. “BenchmarkTools.jl gets the right answer here and Chairmarks does not” is a powerful bug report.

I define the “runtime” of a microbenchmark based on “if I swap f for g in a larger code, how will the runtime change”. We can establish a ground truth for whether f or g is faster by actually embedding them in a larger code.

function macro_benchmark_1(f, n)
    sum = 0
    for i in 1:n
        shift = hash(i)
        for j in 1:n
            sum += f(j, shift)
        end
    end
    sum
end

function macro_benchmark_2(f, n)
    sum = 0
    for i in 1:n
        val = hash(i)
        for j in 1:n
            sum += f(val, j)
        end
    end
    sum
end

function macro_benchmark_3(f, n, m)
    sum = 0
    for i in 1:n
        for j in 1:m
            sum += f(i, j)
        end
    end
    sum
end

function macro_benchmark_4(f, A)
    # https://discourse.julialang.org/t/how-to-shift-bits-faster/19405
    sum = 0
    @inbounds @simd for k = 1:length(A)
        sum += f(k, A[k])
    end
    sum
end

f(x, n) = x << n
g(x, n) = x << (n & 63)

using BenchmarkTools, Chairmarks

# Microbenchmarks

x = UInt128(1); n = 1;
@btime f($x, $n); # 2.500 ns (0 allocations: 0 bytes)
@btime g($x, $n); # 1.958 ns (0 allocations: 0 bytes)

@b f($x, $n) # 1.136 ns
@b g($x, $n) # 1.135 ns

# Macrobenchmarks

@time macro_benchmark_1(f, 100_000); # 0.000104 seconds (1.04 ns per outer iteration)
@time macro_benchmark_1(g, 100_000); # 1.457513 seconds (0.146 ns/iter)

@time macro_benchmark_2(f, 30_000); # 0.762815 seconds (0.848 ns/iter)
@time macro_benchmark_2(g, 30_000); # 0.759171 seconds (0.844 ns/iter)

@time macro_benchmark_3(f, 10_000_000, 100); # 0.339392 seconds (0.339 ns/iter)
@time macro_benchmark_3(g, 10_000_000, 100); # 0.327428 seconds (0.327 ns/iter)

A = rand(0:63, 100_000_000);
@time macro_benchmark_4(f, A); # 0.046893 seconds (0.469 ns/iteration)
@time macro_benchmark_4(g, A); # 0.020474 seconds (0.205 ns/iteration, slightly less than 1 clock cycle)

Julia Version 1.11.0-alpha1
Commit 671de9f5793 (2024-03-01 08:30 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown # Asahi Linux on Mac M2 (3.5 GHz)
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

These macrobenchmarks indicate to me that f and g are too small and the compiler integrates them into their surrounding code too fully for them to be viable candidates for microbenchmarking. I don’t know what the “right answer” is for microbenchmark results.

I think we have some very good reasons to think that f is a slower function than g (namely that f needs to do more work than g, as evidenced by the native and LLVM bytecode for the respective functions).

Here’s some macrobenchmarks I constructed that I think are more representative than your benchmark because they eliminate the reduction you’re doing which can make the differences less apparent:

julia> function macro_benchmark_5!(out, f)
           for j ∈ axes(out, 2)
               x = UInt128(j)
               for i ∈ axes(out, 1)
                   out[i, j] = f(x, i)
               end
           end
       end;

julia> function macro_benchmark_5_noinline!(out, f)
           for j ∈ axes(out, 2)
               x = UInt128(j)
               for i ∈ axes(out, 1)
                   out[i, j] = @noinline f(x, i)
               end
           end
       end;

julia> function macro_benchmark_6!(f, N)
           for j ∈ 1:N
               x = UInt128(j)
               for i ∈ 1:N
                   Base.donotdelete(f(x, i))
               end
           end
       end;

julia> function macro_benchmark_6_noinline!(f, N)
           for j ∈ 1:N
               x = UInt128(j)
               for i ∈ 1:N
                   Base.donotdelete(@noinline f(x, i))
               end
           end
       end;

and here’s the result I see:

julia> f(x, n) = x << n;

julia> g(x, n) = x << (n & 63);

julia> let
           out = Matrix{UInt128}(undef, 10_000, 10_000)
           @time macro_benchmark_5!(out, f)
           @time macro_benchmark_5!(out, g)
           println()
           @time macro_benchmark_5_noinline!(out, f)
           @time macro_benchmark_5_noinline!(out, g)
           println()
           @time macro_benchmark_6!(f, 10_000)
           @time macro_benchmark_6!(g, 10_000)
           println()
           @time macro_benchmark_6_noinline!(f, 10_000)
           @time macro_benchmark_6_noinline!(g, 10_000)
       end
  0.196680 seconds
  0.116391 seconds

  0.581224 seconds
  0.497155 seconds

  0.130048 seconds
  0.129944 seconds

  0.281116 seconds
  0.216197 seconds

consistently demonstrating a performance difference between the two.

Here’s my versioninfo():

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 6 default, 0 interactive, 3 GC (on 12 virtual cores)
Environment:
  JULIA_NUM_THREADS = 6
1 Like

I’d say the most interesting thing to do with these macrobenchmarks is to compare what BenchMarkTools.jl and ChairMarks.jl report for them. Measuring reported differences between microbenchmarks has never been especially useful in judging their accuracy, but if that difference bubbles up to something realistic, that’s a different story.

The whole reason for using these packages is to do micro benchmarks. Otherwise you would just use @time.

4 Likes

I think it would be nice to get an elaboration on this. I know the authors of BenchmarkTools (Jarett Revels and Jiahao Chen) wrote a paper about the benchmarking strategy employed ([1608.04295] Robust benchmarking in noisy environments) in BenchmarkTools so clearly they thought about this a lot.

What did they miss that allowed “hundreds of times” speed up for the same precision and accuracy?

It would be unfortunate if benchmark packages became a “race to the bottom” where finally the “fastest” benchmarking package would just wrap an alias for @elapsed.

10 Likes

Microbenchmarks are the least useful application of a benchmarking tool. Good reasons to use one include reporting mean and median values, detecting performance regression, and quantifying performance improvements. Running single-digit-nanosecond code a bunch of times can only reduce the inherent noise by so much, nor is ignoring what the compiler can do with code when it’s in-context a useful way to quantify that code’s properties.

I’m not dismissing the microbenchmark finding (nor is @Lilith), I am saying that the useful thing to do with the macrobenchmarks is to run them with the tools in question. It’s important to know if the discrepancy translates.

I find @Mason’s justification that the “right answer” is that g is a bit faster than f convincing. This makes the issue @matthias314 reported a high priority issue. I’ll try to fix it. Thank you both.

3 Likes

I just needed to add @noinline (PR). I’ll release the fix once CI passes.

did you see the reply? https://chairmarks.lilithhafner.com/v1.1.1/explanations#How-is-this-faster-than-BenchmarkTools?

Fixed!

(@v1.11) pkg> up
    Updating registry at `~/.julia/registries/General.toml`
   Installed Chairmarks ─ v1.1.2
    Updating `~/.julia/environments/v1.11/Project.toml`
  [0ca39b1e] ↑ Chairmarks v1.1.1 ⇒ v1.1.2
    Updating `~/.julia/environments/v1.11/Manifest.toml`
  [0ca39b1e] ↑ Chairmarks v1.1.1 ⇒ v1.1.2
Precompiling project...
  3 dependencies successfully precompiled in 2 seconds. 68 already precompiled.

julia> x = UInt128(1); n = 1;

julia> f(x, n) = x << n;

julia> g(x, n) = x << (n & 63);

julia> @b f($x, $n)
[ Info: Loading Chairmarks ...
2.560 ns

julia> @b g($x, $n)
2.275 ns
7 Likes

I agree a lot with this!

And yet, there is a place for lower precision faster benchmarks, especially in REPL mode.

Thanks for linking that, interesting. And yet, reading it decreases my trust in benchmarktools, rather than increasing it. The core model is equation (1) on page 3. To excerpt:

Let P_0 be a deterministic benchmark program which consists of an instruction tape consisting of k instructions […]
Let \tau^{[i]} be the run time of instruction I^{i}. Then, the total run time of P_0 can be written T_{P_0} = \Sigma^N_{i=1}\tau^{[i]}.

This model may have been accurate in the late 80s, but today it is so indescribably wrong that it seriously makes me doubt benchmarktools.

Look, the CPU executes lots of instructions in parallel, has a giant reorder buffer, and has incredible amounts of microarchitectural state.

Micro-benchmarking is all chiefly about controlling the microarchitectural state such that it correctly simulates the real context you want to simulate.

Look at e.g. this discussion on how the long memory of the branch predictor interacts with benchmarktools.

5 Likes

This reminds me of nanobench vs google benchmark in C++. nanobench advertises itself as being an order of magnitude faster or more, but after some experimenting, I found nanobench to have a lot of run-to-run noise, so I stuck with googlebenchmark.

I’m looking forward to trying chairmark, though.
Perf gains like not @evaling and hopefully not having BenchmarkTool’s memory leak would be nice wins.

CPUs are likely to reach a “steady state”, but behavior from one execution to another is likely to change.
Note that tools like uica often give percentage of time a particular port is used to actually execute an instruction. There’s a lot of state influencing a modern CPU’s performance.

5 Likes

It seems that Chairmarks can’t get the best code performance yet, and sometimes lossing precision-stability.
Here’s code performance test by BenchmarkTools:

julia> using BenchmarkTools

julia> for i in 1:5
           @btime rand(100, 10000, 100)
       end
  127.274 ms (2 allocations: 762.94 MiB)
  126.542 ms (2 allocations: 762.94 MiB)
  126.427 ms (2 allocations: 762.94 MiB)
  126.196 ms (2 allocations: 762.94 MiB)
  125.712 ms (2 allocations: 762.94 MiB)

And here’s code performance test by Chairmarks:

julia> using Chairmarks

julia> Cm=[];

julia> for i in 1:10
           push!(Cm, (@b rand(100, 10000, 100)))
       end

julia> Cm
10-element Vector{Any}:
 154.621 ms (2 allocs: 762.940 MiB, 13.63% gc time)
 156.477 ms (2 allocs: 762.940 MiB, 11.14% gc time)
 156.475 ms (2 allocs: 762.940 MiB, 12.23% gc time)
 152.151 ms (2 allocs: 762.940 MiB, 12.68% gc time)
 152.033 ms (2 allocs: 762.940 MiB, 9.58% gc time)
 156.964 ms (2 allocs: 762.940 MiB, 9.30% gc time)
 155.749 ms (2 allocs: 762.940 MiB, 9.59% gc time)
 156.672 ms (2 allocs: 762.940 MiB, 11.29% gc time)
 150.282 ms (2 allocs: 762.940 MiB, 10.51% gc time)
 155.334 ms (2 allocs: 762.940 MiB, 13.11% gc time)

(it’s really so fast)
And here’s their precision performance:

julia> using Statistics
julia> BT = [127.274, 126.542, 126.427, 126.196, 125.712];

julia> var(BT)
#BenchmarkTools (5-elements)
0.32379219999999953

julia> var([1000Cm[i].time for i in eachindex(Cm)])
#Chairmarks (10-elements)
5.550361446666649

julia> var([1000Cm[i].time for i in 1:5])
#Chairmarks (5-elements)
4.8290619929999945

Honestly I’m very glad to use it while my daily optimizing , but for some high-performance stage it’s not matured enough.

1 Like

Part of the issue seems to be garbage collection. If you run GC.gc() a few times before each allocation does the variance decrease?

1 Like

I’m a little surprised that the @btime case isn’t triggering the GC, as you’re supposedly allocating 100x10000x100 arrays repeatedly per benchmark. I’d try refactoring to the push! scheme in case that made a difference, though I’m guessing you’d name the results array BM. But that’s just a small part. When the GC is a major factor, how many times you run the function becomes important, and you need to dig into Chairmarks’ and BenchmarkTools’ options to try to keep the number of samples and number of evaluations the same.

So now I’m wondering what makes a good benchmark.

Part of it seems to be handled by the tool’s benchmark loop itself. The calls in the benchmarked code can’t be computed at compile-time, lifted out of the loop, or otherwise optimized away, and the benchmark artifacts need to be excluded from the timing. I’m wondering how those are handled, the earlier nodelete and @noinline seem important.

The other part seems to be what we need to handle. The branch predictor thread showed how much performance can vary on inputs of the same size. So do we randomize inputs in untimed setup phases and set the number of evaluations per sample to 1, perhaps with a warning if the runtime is too fast to ignore timer precision?

Actually the code-performance and precision-performance test with no any unnecessary parts stll work like uppon.

julia> @b rand(100, 10000, 100)
152.232 ms (2 allocs: 762.940 MiB, 11.98% gc time)

julia> @b rand(100, 10000, 100)
148.313 ms (2 allocs: 762.940 MiB, 11.18% gc time)

julia> @b rand(100, 10000, 100)
143.561 ms (2 allocs: 762.940 MiB, 12.68% gc time)

julia> @b rand(100, 10000, 100)
146.598 ms (2 allocs: 762.940 MiB, 11.88% gc time)

julia> @b rand(100, 10000, 100)
148.343 ms (2 allocs: 762.940 MiB, 13.13% gc time)

I just try to show a shorter code area. It’s obviously the loop block and others part affect result, but as qualitative analysis they’re kind same.Why I choose to use an array and push! for save result is because @b won’t print anything when it’s located in a loop body.

BTW, I found somthing strange:

julia> using Chairmarks

#slowly speed, 
julia> @b rand(100, 10000, 100)
153.391 ms (2 allocs: 762.940 MiB, 11.27% gc time)

julia> @b rand(100, 10000, 100)
147.263 ms (2 allocs: 762.940 MiB, 13.35% gc time)

#using BenchmarkTools won't make different
julia> using BenchmarkTools

julia> @b rand(100, 10000, 100)
151.655 ms (2 allocs: 762.940 MiB, 11.75% gc time)

julia> @b rand(100, 10000, 100)
147.708 ms (2 allocs: 762.940 MiB, 11.69% gc time)

#run @btime
julia> @btime rand(100, 10000, 100);
  122.444 ms (2 allocations: 762.94 MiB)

julia> @b rand(100, 10000, 100)
123.895 ms (2 allocs: 762.940 MiB)

julia> @b rand(100, 10000, 100)
123.349 ms (2 allocs: 762.940 MiB)

The @b result is quiet different depend on whether you run @btime before.
After running @btime, it’s result stability get same level as BenchmarkTools, too.

1 Like

For quick-and-dirty REPL work, @btime is fine, and @b may be even better.

For micro-benchmarks, i.e. benchmarking something small and fast:

I am no fan of frameworks that try to control the benchmark loop for tiny snippets.

Instead, I prefer to get a rough timing via @btime, and then write my own benchmark loop (e.g. “do this operation N=10k times”), taking care to massage the optimization context and microarchitectural state to resemble what I imagine the true context to be like. Then I check for a handful of N whether the time is linear enough. Theoretically @time could suffice then (normalized by execution count N), but the histograms + warmup + etc are sure nice.

Important microarchitectural things to think about:

  1. Is the branch-predictor cheating?
  2. You’re measuring with hot cache. Make your data structures bigger if you want to include cache-misses in your measurement (you will see timings drop off a cliff once you get L1 misses, more cliffs for L2, L3, TLB).
  3. Make sure that you’re not merely measuring memory bandwidth. Seriously, you MUST know your machines memory bandwidth by heart (@btime A .= B for a 1GB array).
  4. Do you want to measure throughput or latency? Depends on the intended context your function will be used in! Choose appropriately.

Indeed, the concept of “this instruction / function takes N cycles” makes no sense for N < 30: Timings, orderings of instructions, and the very concept of “which instruction is executing” have a kind of uncertainty principle, due to pipelining, reordering, superscalar execution. Sure the CPU will tell you an instruction pointer and register states if you interrupt it – but this is a careful lie that is reconstructed after the fact. The reorder buffer on apple M1 is famously >600 deep – so your uncertainty principle extends for 600 instructions!

Processor manuals will give you at least two numbers for each instruction: Latency and Inverse throughput. So some instruction may have a latency of 2 cycles and an inverse throughput of 0.5 cycles. This means that the instruction takes 2 cycles to complete, and 4 of them can run in parallel. If each of your instruction takes the previous output as input, then the correct timing is 2 cycles; if it doesn’t, then the correct timing is 0.5 cycles.

Benchmark frameworks that control the inner loop hide this important distinction. The very fact that they proudly report 4.5ns timings is a contradiction in itself – this is a fundamentally meaningless number (< ROB size).

Ideal is of course to benchmark a real application that has an appropriate scaling parameter N by which you can normalize.

7 Likes

Honestly I’m very glad to use it while my daily optimizing , but for some high-performance stage it’s not matured enough.

Lovely! When you get to the high-precision stages, I recommend setting seconds=1 or seconds=5 to communicate to Chairmarks your desire for increased precision and willingness to wait a bit longer for results. Feel free to chime in here as well, to make sure that proposed feature aligns with your needs.

julia> f(t) = display(@b rand(100, 10000, 100) seconds=t)
f (generic function with 1 method)

julia> @time for _ in 1:5 f(.1) end
91.077 ms (3 allocs: 762.940 MiB, 4.34% gc time)
90.072 ms (3 allocs: 762.940 MiB, 4.11% gc time)
90.770 ms (3 allocs: 762.940 MiB, 4.26% gc time)
91.512 ms (3 allocs: 762.940 MiB, 4.26% gc time)
90.910 ms (3 allocs: 762.940 MiB, 4.28% gc time)
  1.437093 seconds (2.04 k allocations: 11.176 GiB, 4.68% gc time, 2.33% compilation time)

julia> @time for _ in 1:5 f(1) end
89.918 ms (3 allocs: 762.940 MiB, 3.93% gc time)
90.421 ms (3 allocs: 762.940 MiB, 4.10% gc time)
90.148 ms (3 allocs: 762.940 MiB, 4.03% gc time)
90.376 ms (3 allocs: 762.940 MiB, 4.03% gc time)
90.000 ms (3 allocs: 762.940 MiB, 4.05% gc time)
  5.633132 seconds (709 allocations: 45.449 GiB, 4.39% gc time, 0.04% compilation time)

julia> @time for _ in 1:5 f(5) end
89.737 ms (3 allocs: 762.940 MiB, 3.94% gc time)
90.512 ms (3 allocs: 762.940 MiB, 3.97% gc time)
90.606 ms (3 allocs: 762.940 MiB, 3.79% gc time)
90.571 ms (3 allocs: 762.940 MiB, 3.72% gc time)
90.134 ms (3 allocs: 762.940 MiB, 3.70% gc time)
 25.680946 seconds (1.72 k allocations: 207.871 GiB, 4.14% gc time)

julia> g(t) = @btime rand(100, 10000, 100) seconds=t
g (generic function with 1 method)

julia> @time for _ in 1:5 g(.1) end
  88.445 ms (3 allocations: 762.94 MiB)
  89.041 ms (3 allocations: 762.94 MiB)
  88.616 ms (3 allocations: 762.94 MiB)
  87.718 ms (3 allocations: 762.94 MiB)
  94.014 ms (3 allocations: 762.94 MiB)
  4.658629 seconds (118.12 k allocations: 22.357 GiB, 39.70% gc time, 1.18% compilation time)

julia> @time for _ in 1:5 g(1) end
  91.073 ms (3 allocations: 762.94 MiB)
  90.805 ms (3 allocations: 762.94 MiB)
  89.535 ms (3 allocations: 762.94 MiB)
  91.025 ms (3 allocations: 762.94 MiB)
  89.329 ms (3 allocations: 762.94 MiB)
 13.233828 seconds (118.92 k allocations: 81.217 GiB, 22.03% gc time, 1.56% compilation time)

julia> @time for _ in 1:5 g(5) end
  89.208 ms (3 allocations: 762.94 MiB)
  91.972 ms (3 allocations: 762.94 MiB)
  91.540 ms (3 allocations: 762.94 MiB)
  91.021 ms (3 allocations: 762.94 MiB)
  91.626 ms (3 allocations: 762.94 MiB)
 55.815202 seconds (122.62 k allocations: 373.280 GiB, 15.00% gc time, 1.03% compilation time)
2 Likes

I’m curious how many of these changes are possible to port to BenchmarkTools.

  • If I understand correctly, Chairmarks avoids a memory leak is by defining the function locally. Could we use the same design in BenchmarkTools to avoid memory leaks, but maintain the other semantics of BenchmarkTools? i.e. @btime local=true?
  • Could we add a flag to BenchmarkTools to avoid gc if the user doesn’t want to gc?

Because BenchmarkTools has become a standard package for benchmarking, it would be nice if those who have been pointed there don’t encounter memory leaks by default.

11 Likes

This is already supported, see Manual · BenchmarkTools.jl

  • gctrial: If true, run gc() before executing this benchmark’s trial. Defaults to BenchmarkTools.DEFAULT_PARAMETERS.gctrial = true.
  • gcsample: If true, run gc() before each sample. Defaults to BenchmarkTools.DEFAULT_PARAMETERS.gcsample = false.
2 Likes