In a separate topic discussing why systematic errors appear when using BenchmarkTools, I end up claiming that
BenchmarkTools, which happens to be great to measure benchmarks, could introduce improvements when it comes to compare benchmarks.
Since that topic was resolved and I needed specific examples to convey how this improvement may look like I ended my participation suggesting opening a new topic on this matter that seemed to spark the interest of at least @tisztamo and @Sukera. Here we go.
- How using
BenchmarkToolswill fail to detect effects in one specific example.
- One general statistical test strategy to compare benchmarks.
As it turns out the
BenchmarkTools documentation in development already talks about potential sources of errors. In particular to solve systematic errors they recommend Caching Parameters in separate Julia sessions, and to deal with non-systematic machine/OS errors they recommend a list of low level settings not for the faint of heart or for anybody without admin rights.
When looking for information in the Internet I actually found lots dealing with the low level side of it, which is important, but not so much dealing with the systematic side of it, which is statistically critical.
Incidentally, these very important warnings and others come at the very end in the
BenchmarkTools manual, just like the warnings in the Book of Cagliostro… Strange.
Admittedly, unless I go through every single piece of advice in the docs I cannot hold that
BenchmarkTools provided a failed comparison. That’s why in this example we are going to imagine we work in a server with no admin rights to try anything low level, we will though restart Julia sessions between benchmarks.
A(x) = for i in 1:101 x\x end B(x) = for i in 1:100 x\x end
The matrix operation x\x is quite robust against optimizations and a benchmarks ratio estimation close to the theoretical 1.01 ratio is to be expected.
# run in new Julia session using BenchmarkTools, Random Random.seed!(1) n = 100 A(x) = for i in 1:101 x\x end @benchmark A(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000 BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 21.825 ms … 3.182 s ┊ GC (min … max): 0.00% … 0.00% Time (median): 34.535 ms ┊ GC (median): 0.00% Time (mean ± σ): 42.685 ms ± 58.542 ms ┊ GC (mean ± σ): 1.26% ± 2.56%
Now we restart Julia and again with
# run in new Julia session using BenchmarkTools, Random Random.seed!(1) n = 100 B(x) = for i in 1:100 x\x end @benchmark B(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000 BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 21.867 ms … 439.327 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 34.535 ms ┊ GC (median): 0.00% Time (mean ± σ): 36.388 ms ± 14.340 ms ┊ GC (mean ± σ): 1.46% ± 2.56%
The estimated ratio for min, median and mean are:
* min: 21.825 / 21.867 = 0.9980792975716832 * median: 34.535 / 34.535 = 1.0 * mean: 42.685 / 36.388 = 1.1730515554578433
The manual next advises some tools like
ratio to compare benchmark results but there is no point since all comparisons will be quite wrong in this case.
Let’s imagine we need to publish these results in a scientific journal, how would we go about it? The real ratio between A and B is about 1.01, however:
minratio says B is slower than A
medianratio says that they’re identical
meanratio says B is about 17% faster than A!
Were 10,000 evaluations enough? How could we tell anyway? How about if we publish they’re identical and the next day other researchers call out our results showing unfair favoritism for A? Should we repeat the experiment? How many times? How do we account for all the repetitions we are not showing in our paper? Might our careers and credibility be on the line?
Interestingly, despite the fact that most of the information I’ve found about benchmarks in the Internet goes around reducing noise, from an statistical point of view noise reduction is not that relevant because more noise only means more runs to detect an effect, not that the effect cannot be detected.
Conversely, an unaccounted systematic error can not only mask an effect but to provide an effect in the opposite direction that no amount of runs will make it go. Is this systematic error we will be dealing next as well as with the machine/OS noise even if we don’t have admin rights.
A(x) = for i in 1:101 x\x end B(x) = for i in 1:100 x\x end
The order in which functions are loaded in Julia already causes a systematic error, hence the advise in the
BenchmarkTools manual to use new Julia sessions per benchmark. However, there might be ways in which we can force Julia not to introduce any systematic error when executing functions. Le’ts consider the following code:
using Pkg, Random, BenchmarkTools Pkg.develop("BenchmarkTools") # v1.1.2 with unsorted times results fa = x -> for i in 1:101 x\x end # Anonymous A(x) function fb = x -> for i in 1:100 x\x end # Anonymous B(x) function # A(x) vs B(x) vs Control function function vs(a,b,x) f = nothing t = pop!(abc) f = (t=="c") ? (x->0) : (t=="a" ? a : b) push!(abcs,t) f(x) end global abc # a,b,c run sequence global abcs # a,b,c warmup shifted run sequence abct =  # abc times for i in 1:100 abcs =  abc = repeat(["c","b","a"], 300+1); # padding with 300+1 to allow for warmup runs Random.seed!(1) f = @benchmark vs(fa,fb,x) setup = (x=rand(100,100)) evals = 1 samples = 300 seconds = 300 abcs = abcs[length(abcs)-length(f.times)+1:end] # warmup shift a = f.times[findall(==("a"),abcs)] b = f.times[findall(==("b"),abcs)] c = f.times[findall(==("c"),abcs)] m = min(length(a),length(b), length(c)) push!(abct, median(a[1:m]-c[1:m]) / median(b[1:m]-c[1:m])) end
The overall strategy is:
- Turn the functions A and B we want to compare into anonymous functions.
- Create a
versusfunction that sequentially runs the anonymous functions for A and B plus a Control function that returns 0.
- Create a population of runs.
- Do statistics on such population.
In this example, steps 1. and 2. deal with systematic errors, steps 3. and 4. deal with machine/OS noise.
The example above runs 10,000 samples summarized in 100 runs per A,B and C. If now we run an statistical test on the collected median ratios we have:
julia> using HypothesisTests, Statistics julia> HypothesisTests.OneSampleTTest(mean(abct),std(abct),length(abct),1.0) One sample t-test ----------------- Population details: parameter of interest: Mean value under h_0: 1.0 point estimate: 1.01123 95% confidence interval: (1.0094, 1.0131) Test summary: outcome with 95% confidence: reject h_0 two-sided p-value: <1e-20 Details: number of observations: 100 t-statistic: 12.122971746977269 degrees of freedom: 99 empirical standard error: 0.0009261004013944243
These results tell us that the point estimate is 1.01123 (the expected one is 1.01), they also tell us with a confidence of 95% that the real ratio is within (1.0094, 1.0131) and that we can be statistically certain that B is faster than A (p-value ≈ 0).
We answered which algorithm is faster, by how much and how certain we are on those results. Now we can publish these results with peace of mind.
Ernest Rutherford once said
If your experiment needs statistics, you ought to have done a better experiment
Little he knew at the time that probability might be an inherent trait of reality, however Rutherford’s attitude is still prevalent in some circles; “Let’s get more data”, “Let’s run more trials”, “Let’s engineer less noise” etc. All with the goal of achieving such accurate results that statistics are not needed anymore… Until they are, and hopefully I made a case for it.