## Background

In a separate topic discussing why systematic errors appear when using BenchmarkTools, I end up claiming that `BenchmarkTools`

, which happens to be great to measure benchmarks, could introduce improvements when it comes to compare benchmarks.

Since that topic was resolved and I needed specific examples to convey how this improvement may look like I ended my participation suggesting opening a new topic on this matter that seemed to spark the interest of at least @tisztamo and @Sukera. Here we go.

#### Next we will show:

- How using
`BenchmarkTools`

will fail to detect effects in one specific example. - One general statistical test strategy to compare benchmarks.
- Conclusions

## Intro

As it turns out the `BenchmarkTools`

documentation in development already talks about potential sources of errors. In particular to solve systematic errors they recommend Caching Parameters in separate Julia sessions, and to deal with non-systematic machine/OS errors they recommend a list of low level settings not for the faint of heart or for anybody without admin rights.

When looking for information in the Internet I actually found lots dealing with the low level side of it, which is important, but not so much dealing with the systematic side of it, which is statistically **critical**.

Incidentally, these very important warnings and others come at the very end in the `BenchmarkTools`

manual, just like the warnings in the Book of Cagliostro… Strange.

## BenchmarkTools failed comparison

Admittedly, unless I go through every single piece of advice in the docs I cannot hold that `BenchmarkTools`

provided a failed comparison. That’s why in this example we are going to imagine we work in a server with no admin rights to try anything low level, we will though restart Julia sessions between benchmarks.

#### Functions to be compared

```
A(x) = for i in 1:101 x\x end
B(x) = for i in 1:100 x\x end
```

The matrix operation x\x is quite robust against optimizations and a benchmarks ratio estimation close to the theoretical 1.01 ratio is to be expected.

#### BenchmarkTools results

```
# run in new Julia session
using BenchmarkTools, Random
Random.seed!(1)
n = 100
A(x) = for i in 1:101 x\x end
@benchmark A(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 21.825 ms … 3.182 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.535 ms ┊ GC (median): 0.00%
Time (mean ± σ): 42.685 ms ± 58.542 ms ┊ GC (mean ± σ): 1.26% ± 2.56%
```

Now we restart Julia and again with `B(x)`

```
# run in new Julia session
using BenchmarkTools, Random
Random.seed!(1)
n = 100
B(x) = for i in 1:100 x\x end
@benchmark B(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 21.867 ms … 439.327 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.535 ms ┊ GC (median): 0.00%
Time (mean ± σ): 36.388 ms ± 14.340 ms ┊ GC (mean ± σ): 1.46% ± 2.56%
```

The estimated ratio for min, median and mean are:

```
* min: 21.825 / 21.867 = 0.9980792975716832
* median: 34.535 / 34.535 = 1.0
* mean: 42.685 / 36.388 = 1.1730515554578433
```

The manual next advises some tools like `judge`

, and `ratio`

to compare benchmark results but there is no point since all comparisons will be quite wrong in this case.

**Let’s imagine we need to publish these results in a scientific journal,** how would we go about it? The real ratio between A and B is about 1.01, however:

- The
`min`

ratio says B is**slower**than A - The
`median`

ratio says that they’re**identical** - The
`mean`

ratio says B is about 17%**faster**than A!

Were 10,000 evaluations enough? How could we tell anyway? How about if we publish they’re identical and the next day other researchers call out our results showing unfair favoritism for A? Should we repeat the experiment? How many times? How do we account for all the repetitions we are not showing in our paper? Might our careers and credibility be on the line?

## Statistical Test Strategy

Interestingly, despite the fact that most of the information I’ve found about benchmarks in the Internet goes around reducing noise, from an statistical point of view **noise reduction is not that relevant** because more noise only means more runs to detect an effect, not that the effect cannot be detected.

Conversely, an unaccounted systematic error can not only mask an effect but to provide an effect in the opposite direction that no amount of runs will make it go. Is this systematic error we will be dealing next as well as with the machine/OS noise **even if we don’t have admin rights.**

#### Functions to be compared

```
A(x) = for i in 1:101 x\x end
B(x) = for i in 1:100 x\x end
```

The order in which functions are loaded in Julia already causes a systematic error, hence the advise in the `BenchmarkTools`

manual to use new Julia sessions per benchmark. However, there might be ways in which we can force Julia not to introduce any systematic error when executing functions. Le’ts consider the following code:

```
using Pkg, Random, BenchmarkTools
Pkg.develop("BenchmarkTools") # v1.1.2 with unsorted times results
fa = x -> for i in 1:101 x\x end # Anonymous A(x) function
fb = x -> for i in 1:100 x\x end # Anonymous B(x) function
# A(x) vs B(x) vs Control function
function vs(a,b,x)
f = nothing
t = pop!(abc)
f = (t=="c") ? (x->0) : (t=="a" ? a : b)
push!(abcs,t)
f(x)
end
global abc # a,b,c run sequence
global abcs # a,b,c warmup shifted run sequence
abct = [] # abc times
for i in 1:100
abcs = []
abc = repeat(["c","b","a"], 300+1); # padding with 300+1 to allow for warmup runs
Random.seed!(1)
f = @benchmark vs(fa,fb,x) setup = (x=rand(100,100)) evals = 1 samples = 300 seconds = 300
abcs = abcs[length(abcs)-length(f.times)+1:end] # warmup shift
a = f.times[findall(==("a"),abcs)]
b = f.times[findall(==("b"),abcs)]
c = f.times[findall(==("c"),abcs)]
m = min(length(a),length(b), length(c))
push!(abct, median(a[1:m]-c[1:m]) / median(b[1:m]-c[1:m]))
end
```

The overall strategy is:

- Turn the functions A and B we want to compare into anonymous functions.
- Create a
`versus`

function that sequentially runs the anonymous functions for A and B plus a Control function that returns 0. - Create a population of runs.
- Do statistics on such population.

In this example, steps 1. and 2. deal with systematic errors, steps 3. and 4. deal with machine/OS noise.

The example above runs 10,000 samples summarized in 100 runs per A,B and C. If now we run an statistical test on the collected median ratios we have:

```
julia> using HypothesisTests, Statistics
julia> HypothesisTests.OneSampleTTest(mean(abct),std(abct),length(abct),1.0)
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 1.0
point estimate: 1.01123
95% confidence interval: (1.0094, 1.0131)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-20
Details:
number of observations: 100
t-statistic: 12.122971746977269
degrees of freedom: 99
empirical standard error: 0.0009261004013944243
```

These results tell us that the point estimate is 1.01123 (the expected one is 1.01), they also tell us with a confidence of 95% that the real ratio is within (1.0094, 1.0131) and that we can be statistically certain that B is faster than A (p-value ≈ 0).

We answered which algorithm is faster, by how much and how certain we are on those results. **Now we can publish these results with peace of mind.**

## Conclusions

Ernest Rutherford once said

If your experiment needs statistics, you ought to have done a better experiment

Little he knew at the time that probability might be an inherent trait of reality, however Rutherford’s attitude is still prevalent in some circles; “Let’s get more data”, “Let’s run more trials”, “Let’s engineer less noise” etc. All with the goal of achieving such accurate results that statistics are not needed anymore… Until they are, and hopefully I made a case for it.