Difference in microbenchmark result, Chairmarks.jl vs BenchmarkTools.jl

TheLateKronos · March 15, 2024, 10:41am

Hi all. I am using BenchmarkTools.jl and Chairmarks.jl to compare the performance of the following (hopefully) mathematically equivalent expressions, to see if they are computationally equivalent. The results are as follows:

julia> using Chairmarks, BenchmarkTools

julia> @btime Δp/(abs(cis(deg2rad(Δa))-1)) setup = (Δp=1; Δa=10);
  0.977 ns (0 allocations: 0 bytes)

julia> @btime sqrt(Δp^2 / (2 - 2cosd(Δa))) setup = (Δp=1; Δa=10);
  0.978 ns (0 allocations: 0 bytes)

julia> @btime Δp/2sind(Δa/2) setup = (Δp=1; Δa=10);
  0.978 ns (0 allocations: 0 bytes)

julia> @b (Δp=1, Δa=10) _.Δp/(abs(cis(deg2rad(_.Δa))-1))
15.411 ns

julia> @b (Δp=1, Δa=10) sqrt(_.Δp^2 / (2 - 2cosd(_.Δa)))
17.973 ns

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
17.439 ns

As you can see, there are two problems here. First, the benchmarks report a difference in the minimal timing of about 16x. However, for a microbenchmark, this is not uncommon from what I understand, and it depends a lot on the benchmarking methodology.

The larger issue is that @btime reports that they are computationally equivalent, whereas @b reports the second as 16.6% slower and the third as 13.5% slower than the first. In such a situation, which tool should one trust? And how can the settings be tweaked to increase the confidence in the benchmarks?

Alseidon · March 15, 2024, 11:07am

I have no clue what’s causing this, but wanted to try it out; BenchmarkTools.jl reported the same time all three times (1.058 ns for me), but Chairmarks.jl did this:

julia> @b (Δp=1, Δa=10) _.Δp/(abs(cis(deg2rad(_.Δa))-1))
35.375 ns

julia> @b (Δp=1, Δa=10) sqrt(_.Δp^2 / (2 - 2cosd(_.Δa)))
21.053 ns

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
20.653 ns

So, completely opposite to your results, with much larger time differences…
Edit: added versioninfo() output.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × 13th Gen Intel(R) Core(TM) i7-1365U
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
Environment:
  JULIA_EDITOR = gedit

gdalle · March 15, 2024, 11:23am

Same here

julia> using Chairmarks, BenchmarkTools

julia> @btime Δp/(abs(cis(deg2rad(Δa))-1)) setup = (Δp=1; Δa=10);
  1.389 ns (0 allocations: 0 bytes)

julia> @btime sqrt(Δp^2 / (2 - 2cosd(Δa))) setup = (Δp=1; Δa=10);
  1.411 ns (0 allocations: 0 bytes)

julia> @btime Δp/2sind(Δa/2) setup = (Δp=1; Δa=10);
  1.411 ns (0 allocations: 0 bytes)

julia> @b (Δp=1, Δa=10) _.Δp/(abs(cis(deg2rad(_.Δa))-1))
34.342 ns

julia> @b (Δp=1, Δa=10) sqrt(_.Δp^2 / (2 - 2cosd(_.Δa)))
22.703 ns

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
22.438 ns

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

abraemer · March 15, 2024, 11:25am

To me this looks like BenchmarkTools.jl completely constantfolded the computation. There is essentially no possibility of achieving sub ns timings else.
Try wrapping in Refs and then the results should become comparable:

julia> @btime Δp/(abs(cis(deg2rad(Δa))-1)) setup = (Δp=1; Δa=10);
  0.977 ns (0 allocations: 0 bytes)

julia> @btime sqrt(Δp^2 / (2 - 2cosd(Δa))) setup = (Δp=1; Δa=10);
  0.978 ns (0 allocations: 0 bytes)

julia> @btime Δp/2sind(Δa/2) setup = (Δp=1; Δa=10);
  0.978 ns (0 allocations: 0 bytes)

julia> @b (Δp=1, Δa=10) _.Δp/(abs(cis(deg2rad(_.Δa))-1))
17.121 ns

julia> @b (Δp=1, Δa=10) sqrt(_.Δp^2 / (2 - 2cosd(_.Δa)))
18.959 ns

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
19.761 ns

julia> @btime Δp[]/(abs(cis(deg2rad(Δa[]))-1)) setup = (Δp=Ref(1); Δa=Ref(10));
  17.075 ns (0 allocations: 0 bytes)

julia> @btime sqrt(Δp[]^2 / (2 - 2cosd(Δa[]))) setup = (Δp=Ref(1); Δa=Ref(10));
  18.984 ns (0 allocations: 0 bytes)

julia> @btime Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10));
  12.304 ns (0 allocations: 0 bytes)

TheLateKronos · March 19, 2024, 8:50am

Seems like constand folding explains it the sub-ns results for BenchmarkTools. However, the timing reults vary a lot:

Username	@b complex	@b squared	@b sin	@btime complex	@btime squared	@btime sin
Alseidon	35.375 ns	21.053 ns	20.653 ns
GDalle	34.342 ns	22.703 ns	22.438 ns
abraemer	17.121 ns	18.959 ns	19.761 ns	17.075 ns	18.984 ns	12.304 ns
KronosTheLate	15.896 ns	17.996 ns	17.658 ns	15.466 ns	17.934 ns	9.997 ns

@b reports that complex is the slowest for Alseidon and GDalle, while it is the fastest according to @b for me and abraemer. This seems to be readily explained by different CPU architecture (or julia version - I am on 1.11).

But on the same computers, the results from @b and @btime (with Ref) flip the results for me and abraemer, which is concerning. With @b, the complex version is a little faster, while @btime reports that sin is the fastest by far.

It seems like different things are being benchmarked, and they give opposing results. Tagging @Lilith who might be able to understand what is up with that.

abraemer · March 19, 2024, 9:23am

CPU architecture likely is the reason for the differences within the benchmarks. Alseidon and GDalle have Intel CPUs and I am on AMD. I don’t know what Julia I ran the benchmarks on, so I reran them on 1.10.2 for completeness. I got similar results to before:

julia> using Chairmarks, BenchmarkTools
julia> @b (Δp=1, Δa=10) _.Δp/(abs(cis(deg2rad(_.Δa))-1))
17.091 ns
julia> @b (Δp=1, Δa=10) sqrt(_.Δp^2 / (2 - 2cosd(_.Δa)))
18.813 ns
julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
19.859 ns
julia> @btime Δp[]/(abs(cis(deg2rad(Δa[]))-1)) setup = (Δp=Ref(1); Δa=Ref(10));
  16.797 ns (0 allocations: 0 bytes)
julia> @btime sqrt(Δp[]^2 / (2 - 2cosd(Δa[]))) setup = (Δp=Ref(1); Δa=Ref(10));
  18.986 ns (0 allocations: 0 bytes)
julia> @btime Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10));
  12.305 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

The first case seems now to be a bit faster for BenchmarkTools.jl. But rerunning them all a couple of times gives upto \pm0.2 ns of timing fluctuation for the benchmarks. So the difference between the Chairmarks.jl and BenchmarkTools.jl is probably negligible for the “complex” and “squared” case but real for the “sin” case for me.

Lilith · March 19, 2024, 6:58pm

The function _.Δp/2sind(_.Δa/2) has the strange property that, on some hardware, it gets much faster after about 10 million evaluations in rapid sequence.

julia> data = @be (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
Benchmark: 34474 samples with 415 evaluations
min    44.790 ns
median 70.937 ns
mean   64.431 ns
max    3.556 μs

julia> using UnicodePlots

julia> scatterplot([log(s.time) for s in data.samples])
       ┌────────────────────────────────────────┐ 
   -12 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⣰⣄⣀⣄⣄⣤⣀⣄⣠⣀⣀⣅⣀⣄⣀⣄⣀⣄⣀⣠⣦⣠⣄⣠⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
       │⠒⠓⠒⠒⠒⠒⠒⠒⠚⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠂⡀⠀⠀⡀⠀⠀⢀⠀⡂⠀⠀⠀⠀⠀│ 
   -17 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣷⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀│ 
       └────────────────────────────────────────┘ 
       ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀40 000⠀ 

julia> findfirst(x -> x.time < 50e-9, data.samples)*415
10474600

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
70.920 ns

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
44.786 ns

julia> @btime Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10));
  43.963 ns (0 allocations: 0 bytes)


julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, nehalem)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

You can also get faster runtime by running GC.gc() 4 times before benchmarking. I have no idea why. Perhaps Diogo Netto knows?

julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
70.918 ns

julia> GC.gc();GC.gc();GC.gc();GC.gc(); @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
44.758 ns

It’s unclear to me weather the slow runtime or the fast runtime is “right”.

TheLateKronos · March 20, 2024, 10:09am

Oh wow, that is really strange and interesting! I feel like this sort of breaks what I thought was fundamental to reporting the minimal runtime in benchmarking, which I understood as “The minimal runtime is the lease noisy, and therefore the most representative”.

It does indeed appear like in specific situations, like running < 10 million runs or running GC 4 times just before, there is something akin to negative noise in runtime. I call it noise because I find these situations to be unrealistic, and in best case very rare, in actual code.

It would actually appear to me that it is an error of @btime to do as much as it does (triggering GC and doing really many runs), because of the possibility of such negative noise.

Perhaps it would be good to report the lower 5% quantile instead, i.e. the fastest runtime excluding the fastest 5%, to protect against such “negative noise”? Or, I guess the literate user actually has to look as the distribution (and possibly time-evolution) of the samples to get the full story in every case.

Thanks for making that investigation for me, it was quite illustrative! Turns out that benchmarking is harder than I thought.

abraemer · March 20, 2024, 11:03am

I ran your example just out of curiosity and for me it looks a bit different: There is no such pattern visible - just the speed of the function is apparently different.

julia> data = @be (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
Benchmark: 30889 samples with 1436 evaluations
min    19.600 ns
median 20.623 ns
mean   20.790 ns
max    45.038 ns

julia> scatterplot([log(s.time) for s in data.samples])
         ┌────────────────────────────────────────┐ 
   -16.9 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠁⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠂⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠈⢀⠀⡀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠈⠀⠀⠀⡀⠂⠀⠄⠀⢐⠀⠀⡄⠀⠀⠀⠀⠀⠀⠁⠀⠠⠀⠂⢀⠀⡂⠢⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠄⢉⡒⡆⣂⠨⠜⠨⠔⠄⠞⢪⣂⢑⠂⢣⡤⢆⣤⡐⡠⣐⢃⠂⣜⠎⡊⢠⢆⢡⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⢒⡐⠓⠐⠈⠋⠏⠧⡌⠇⠊⠉⠘⠓⠴⠉⡴⠳⣑⠡⠍⠢⠙⢱⠂⠥⠐⠓⠸⠡⠓⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠄⠀⠀⠂⠀⠀⠐⠀⠀⠀⠀⠀⢰⡀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⢶⣳⣾⣷⣶⣷⣶⣶⣶⣶⡶⣶⡶⣶⡶⣶⢖⡷⣶⠶⢶⣶⣶⡲⡶⣼⠾⢾⣴⢦⡴⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⣟⣿⣽⣿⣷⣗⣿⣾⣿⢿⣗⣾⣞⢾⣿⣳⣶⣿⣿⣿⣻⣷⣶⣷⣾⣾⣾⣷⣿⣟⣽⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⣿⣿⣿⣿⣿⣿⢿⣿⣿⣿⣿⢿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿⣿⡿⣿⡿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
   -17.8 │⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠀⠈⠀⠀⠈⠀⠀⠀⠁⠀⠈⠀⠀⠀⠀⠀⠀⠁⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         └────────────────────────────────────────┘ 
         ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀40 000⠀ 
julia> data2 = @benchmark Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.025 ns … 29.853 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.493 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.562 ns ±  0.647 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▄▄ ▄█    ▃                            
  ▂▁▁▂▁▂▂▂▁▂▂▂▁▂▂▂▁▃▃▂▃▁▅██▁██▃▁▆██▁▇▅▄▁▂▂▂▂▁▂▂▂▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
  12 ns           Histogram: frequency by time        15.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> scatterplot(log.(data2.times*1e-9))
         ┌────────────────────────────────────────┐ 
   -17.3 │⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⡀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠐⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠂⠀⠀⠈⠀⠠⠀⠀⠠⠀⠂⠂⠂⠀⠂⠀⡀⠀⡀⠀⠈⠐⠀⠀⠀⠀⢀⠀⠐⠀⠀⠀⠀⠂⠄│ 
         │⠀⠁⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠁⠀⠄⠀⠁⡀⠀⠀⠀⠀⠀⠀⠠⠀⠐⠁⠁⠁⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         │⠚⡀⢀⠂⠖⠀⠠⡠⡰⠦⣔⠀⣔⠊⡢⣪⡀⠐⠰⡎⢁⡈⢖⠀⠀⠐⢴⡐⠀⠐⠲⡲⠰⠹⣔⢢⢢⠞⡒⠂│ 
         │⣆⣠⣠⣔⣾⣢⣀⣠⣈⣲⣘⣀⣐⣰⣑⣎⣆⣄⣌⣐⣐⣴⣂⣀⣀⣮⣐⣀⣀⣼⣀⣐⣠⣶⣐⣂⣀⣖⣃⣆│ 
         │⣿⣿⣿⣿⠿⣿⣿⣿⡿⢿⣿⣿⣿⡿⣿⣿⡿⠿⣿⣿⣿⣿⣿⢿⠿⣿⣿⠿⣿⣿⡿⣿⡿⣿⡿⣿⡿⣿⡿⣿│ 
         │⠁⠚⠀⠨⠈⠿⠇⠀⠛⢉⠅⠑⠈⠛⠰⠿⠃⠑⠍⠇⠅⠯⠍⠀⠀⣫⠈⠐⠑⠽⠑⠀⠃⠪⠄⠸⠠⠭⠁⠃│ 
   -18.3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
         └────────────────────────────────────────┘ 
         ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10 000⠀

So perhaps this pattern is a feature of Intel CPUs?

Topic		Replies	Views
Chairmarks.jl Package Announcements memory-allocation , benchmark , speed-optimization	82	4821	March 12, 2024
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2943	August 2, 2021
How much is it normal that @time differs in time? New to Julia	20	2441	June 28, 2017
@time and @benchmark give different results Performance question , benchmarktools	6	1569	August 16, 2021
@time and @btime report different values? General Usage	17	763	May 1, 2022

Difference in microbenchmark result, Chairmarks.jl vs BenchmarkTools.jl

Related topics