Hi all. I am using BenchmarkTools.jl and Chairmarks.jl to compare the performance of the following (hopefully) mathematically equivalent expressions, to see if they are computationally equivalent. The results are as follows:
As you can see, there are two problems here. First, the benchmarks report a difference in the minimal timing of about 16x. However, for a microbenchmark, this is not uncommon from what I understand, and it depends a lot on the benchmarking methodology.
The larger issue is that @btime reports that they are computationally equivalent, whereas @b reports the second as 16.6% slower and the third as 13.5% slower than the first. In such a situation, which tool should one trust? And how can the settings be tweaked to increase the confidence in the benchmarks?
I have no clue what’s causing this, but wanted to try it out; BenchmarkTools.jl reported the same time all three times (1.058 ns for me), but Chairmarks.jl did this:
To me this looks like BenchmarkTools.jl completely constantfolded the computation. There is essentially no possibility of achieving sub ns timings else.
Try wrapping in Refs and then the results should become comparable:
Seems like constand folding explains it the sub-ns results for BenchmarkTools. However, the timing reults vary a lot:
Username
@b complex
@b squared
@b sin
@btime complex
@btime squared
@btime sin
Alseidon
35.375 ns
21.053 ns
20.653 ns
GDalle
34.342 ns
22.703 ns
22.438 ns
abraemer
17.121 ns
18.959 ns
19.761 ns
17.075 ns
18.984 ns
12.304 ns
KronosTheLate
15.896 ns
17.996 ns
17.658 ns
15.466 ns
17.934 ns
9.997 ns
@b reports that complex is the slowest for Alseidon and GDalle, while it is the fastest according to @b for me and abraemer. This seems to be readily explained by different CPU architecture (or julia version - I am on 1.11).
But on the same computers, the results from @b and @btime (with Ref) flip the results for me and abraemer, which is concerning. With @b, the complex version is a little faster, while @btime reports that sin is the fastest by far.
It seems like different things are being benchmarked, and they give opposing results. Tagging @Lilith who might be able to understand what is up with that.
CPU architecture likely is the reason for the differences within the benchmarks. Alseidon and GDalle have Intel CPUs and I am on AMD. I don’t know what Julia I ran the benchmarks on, so I reran them on 1.10.2 for completeness. I got similar results to before:
The first case seems now to be a bit faster for BenchmarkTools.jl. But rerunning them all a couple of times gives upto \pm0.2 ns of timing fluctuation for the benchmarks. So the difference between the Chairmarks.jl and BenchmarkTools.jl is probably negligible for the “complex” and “squared” case but real for the “sin” case for me.
The function _.Δp/2sind(_.Δa/2) has the strange property that, on some hardware, it gets much faster after about 10 million evaluations in rapid sequence.
julia> data = @be (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
Benchmark: 34474 samples with 415 evaluations
min 44.790 ns
median 70.937 ns
mean 64.431 ns
max 3.556 μs
julia> using UnicodePlots
julia> scatterplot([log(s.time) for s in data.samples])
┌────────────────────────────────────────┐
-12 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⣰⣄⣀⣄⣄⣤⣀⣄⣠⣀⣀⣅⣀⣄⣀⣄⣀⣄⣀⣠⣦⣠⣄⣠⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠒⠓⠒⠒⠒⠒⠒⠒⠚⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠂⡀⠀⠀⡀⠀⠀⢀⠀⡂⠀⠀⠀⠀⠀│
-17 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣷⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀40 000⠀
julia> findfirst(x -> x.time < 50e-9, data.samples)*415
10474600
julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2)
70.920 ns
julia> @b (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
44.786 ns
julia> @btime Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10));
43.963 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, nehalem)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
You can also get faster runtime by running GC.gc() 4 times before benchmarking. I have no idea why. Perhaps Diogo Netto knows?
Oh wow, that is really strange and interesting! I feel like this sort of breaks what I thought was fundamental to reporting the minimal runtime in benchmarking, which I understood as “The minimal runtime is the lease noisy, and therefore the most representative”.
It does indeed appear like in specific situations, like running < 10 million runs or running GC 4 times just before, there is something akin to negative noise in runtime. I call it noise because I find these situations to be unrealistic, and in best case very rare, in actual code.
It would actually appear to me that it is an error of @btime to do as much as it does (triggering GC and doing really many runs), because of the possibility of such negative noise.
Perhaps it would be good to report the lower 5% quantile instead, i.e. the fastest runtime excluding the fastest 5%, to protect against such “negative noise”? Or, I guess the literate user actually has to look as the distribution (and possibly time-evolution) of the samples to get the full story in every case.
Thanks for making that investigation for me, it was quite illustrative! Turns out that benchmarking is harder than I thought.
I ran your example just out of curiosity and for me it looks a bit different: There is no such pattern visible - just the speed of the function is apparently different.
julia> data = @be (Δp=1, Δa=10) _.Δp/2sind(_.Δa/2) seconds=1
Benchmark: 30889 samples with 1436 evaluations
min 19.600 ns
median 20.623 ns
mean 20.790 ns
max 45.038 ns
julia> scatterplot([log(s.time) for s in data.samples])
┌────────────────────────────────────────┐
-16.9 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠁⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠂⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠈⢀⠀⡀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠈⠀⠀⠀⡀⠂⠀⠄⠀⢐⠀⠀⡄⠀⠀⠀⠀⠀⠀⠁⠀⠠⠀⠂⢀⠀⡂⠢⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠄⢉⡒⡆⣂⠨⠜⠨⠔⠄⠞⢪⣂⢑⠂⢣⡤⢆⣤⡐⡠⣐⢃⠂⣜⠎⡊⢠⢆⢡⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢒⡐⠓⠐⠈⠋⠏⠧⡌⠇⠊⠉⠘⠓⠴⠉⡴⠳⣑⠡⠍⠢⠙⢱⠂⠥⠐⠓⠸⠡⠓⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠄⠀⠀⠂⠀⠀⠐⠀⠀⠀⠀⠀⢰⡀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢶⣳⣾⣷⣶⣷⣶⣶⣶⣶⡶⣶⡶⣶⡶⣶⢖⡷⣶⠶⢶⣶⣶⡲⡶⣼⠾⢾⣴⢦⡴⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⣟⣿⣽⣿⣷⣗⣿⣾⣿⢿⣗⣾⣞⢾⣿⣳⣶⣿⣿⣿⣻⣷⣶⣷⣾⣾⣾⣷⣿⣟⣽⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⣿⣿⣿⣿⣿⣿⢿⣿⣿⣿⣿⢿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿⣿⡿⣿⡿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀│
-17.8 │⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠀⠈⠀⠀⠈⠀⠀⠀⠁⠀⠈⠀⠀⠀⠀⠀⠀⠁⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀40 000⠀
julia> data2 = @benchmark Δp[]/2sind(Δa[]/2) setup = (Δp=Ref(1); Δa=Ref(10))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 12.025 ns … 29.853 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.493 ns ┊ GC (median): 0.00%
Time (mean ± σ): 13.562 ns ± 0.647 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄ ▄█ ▃
▂▁▁▂▁▂▂▂▁▂▂▂▁▂▂▂▁▃▃▂▃▁▅██▁██▃▁▆██▁▇▅▄▁▂▂▂▂▁▂▂▂▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
12 ns Histogram: frequency by time 15.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> scatterplot(log.(data2.times*1e-9))
┌────────────────────────────────────────┐
-17.3 │⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⡀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠐⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠂⠀⠀⠈⠀⠠⠀⠀⠠⠀⠂⠂⠂⠀⠂⠀⡀⠀⡀⠀⠈⠐⠀⠀⠀⠀⢀⠀⠐⠀⠀⠀⠀⠂⠄│
│⠀⠁⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠁⠀⠄⠀⠁⡀⠀⠀⠀⠀⠀⠀⠠⠀⠐⠁⠁⠁⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠚⡀⢀⠂⠖⠀⠠⡠⡰⠦⣔⠀⣔⠊⡢⣪⡀⠐⠰⡎⢁⡈⢖⠀⠀⠐⢴⡐⠀⠐⠲⡲⠰⠹⣔⢢⢢⠞⡒⠂│
│⣆⣠⣠⣔⣾⣢⣀⣠⣈⣲⣘⣀⣐⣰⣑⣎⣆⣄⣌⣐⣐⣴⣂⣀⣀⣮⣐⣀⣀⣼⣀⣐⣠⣶⣐⣂⣀⣖⣃⣆│
│⣿⣿⣿⣿⠿⣿⣿⣿⡿⢿⣿⣿⣿⡿⣿⣿⡿⠿⣿⣿⣿⣿⣿⢿⠿⣿⣿⠿⣿⣿⡿⣿⡿⣿⡿⣿⡿⣿⡿⣿│
│⠁⠚⠀⠨⠈⠿⠇⠀⠛⢉⠅⠑⠈⠛⠰⠿⠃⠑⠍⠇⠅⠯⠍⠀⠀⣫⠈⠐⠑⠽⠑⠀⠃⠪⠄⠸⠠⠭⠁⠃│
-18.3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀10 000⠀
So perhaps this pattern is a feature of Intel CPUs?