Benchmark Tests: Improvements for BenchmarkTools


In a separate topic discussing why systematic errors appear when using BenchmarkTools, I end up claiming that BenchmarkTools, which happens to be great to measure benchmarks, could introduce improvements when it comes to compare benchmarks.

Since that topic was resolved and I needed specific examples to convey how this improvement may look like I ended my participation suggesting opening a new topic on this matter that seemed to spark the interest of at least @tisztamo and @Sukera. Here we go.

Next we will show:

  • How using BenchmarkTools will fail to detect effects in one specific example.
  • One general statistical test strategy to compare benchmarks.
  • Conclusions


As it turns out the BenchmarkTools documentation in development already talks about potential sources of errors. In particular to solve systematic errors they recommend Caching Parameters in separate Julia sessions, and to deal with non-systematic machine/OS errors they recommend a list of low level settings not for the faint of heart or for anybody without admin rights.

When looking for information in the Internet I actually found lots dealing with the low level side of it, which is important, but not so much dealing with the systematic side of it, which is statistically critical.

Incidentally, these very important warnings and others come at the very end in the BenchmarkTools manual, just like the warnings in the Book of Cagliostro… Strange.

BenchmarkTools failed comparison

Admittedly, unless I go through every single piece of advice in the docs I cannot hold that BenchmarkTools provided a failed comparison. That’s why in this example we are going to imagine we work in a server with no admin rights to try anything low level, we will though restart Julia sessions between benchmarks.

Functions to be compared

A(x) = for i in 1:101 x\x end
B(x) = for i in 1:100 x\x end

The matrix operation x\x is quite robust against optimizations and a benchmarks ratio estimation close to the theoretical 1.01 ratio is to be expected.

BenchmarkTools results

# run in new Julia session
using BenchmarkTools, Random
n = 100
A(x) = for i in 1:101 x\x end
@benchmark A(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  21.825 ms …   3.182 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.535 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.685 ms ± 58.542 ms  ┊ GC (mean ± σ):  1.26% ± 2.56%

Now we restart Julia and again with B(x)

# run in new Julia session
using BenchmarkTools, Random
n = 100
B(x) = for i in 1:100 x\x end
@benchmark B(x) setup = (x=rand(n,n)) evals = 1 samples = 10_000 seconds = 10_000

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  21.867 ms … 439.327 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.535 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.388 ms ±  14.340 ms  ┊ GC (mean ± σ):  1.46% ± 2.56%

The estimated ratio for min, median and mean are:

* min:    21.825 / 21.867 = 0.9980792975716832
* median: 34.535 / 34.535 = 1.0
* mean:   42.685 / 36.388 = 1.1730515554578433

The manual next advises some tools like judge, and ratio to compare benchmark results but there is no point since all comparisons will be quite wrong in this case.

Let’s imagine we need to publish these results in a scientific journal, how would we go about it? The real ratio between A and B is about 1.01, however:

  • The min ratio says B is slower than A
  • The median ratio says that they’re identical
  • The mean ratio says B is about 17% faster than A!

Were 10,000 evaluations enough? How could we tell anyway? How about if we publish they’re identical and the next day other researchers call out our results showing unfair favoritism for A? Should we repeat the experiment? How many times? How do we account for all the repetitions we are not showing in our paper? Might our careers and credibility be on the line?

Statistical Test Strategy

Interestingly, despite the fact that most of the information I’ve found about benchmarks in the Internet goes around reducing noise, from an statistical point of view noise reduction is not that relevant because more noise only means more runs to detect an effect, not that the effect cannot be detected.

Conversely, an unaccounted systematic error can not only mask an effect but to provide an effect in the opposite direction that no amount of runs will make it go. Is this systematic error we will be dealing next as well as with the machine/OS noise even if we don’t have admin rights.

Functions to be compared

A(x) = for i in 1:101 x\x end
B(x) = for i in 1:100 x\x end

The order in which functions are loaded in Julia already causes a systematic error, hence the advise in the BenchmarkTools manual to use new Julia sessions per benchmark. However, there might be ways in which we can force Julia not to introduce any systematic error when executing functions. Le’ts consider the following code:

using Pkg, Random, BenchmarkTools 
Pkg.develop("BenchmarkTools") # v1.1.2 with unsorted times results

fa = x -> for i in 1:101 x\x end # Anonymous A(x) function
fb = x -> for i in 1:100 x\x end # Anonymous B(x) function

# A(x) vs B(x) vs Control function
function vs(a,b,x)
    f = nothing
    t = pop!(abc)
    f = (t=="c") ? (x->0) : (t=="a" ? a : b)

global abc    # a,b,c run sequence
global abcs   # a,b,c warmup shifted run sequence
abct = []     # abc times

for i in 1:100
    abcs = [] 
    abc =  repeat(["c","b","a"], 300+1); # padding with 300+1 to allow for warmup runs
    f = @benchmark vs(fa,fb,x) setup = (x=rand(100,100)) evals = 1 samples = 300 seconds = 300

    abcs = abcs[length(abcs)-length(f.times)+1:end] # warmup shift
    a = f.times[findall(==("a"),abcs)]
    b = f.times[findall(==("b"),abcs)]
    c = f.times[findall(==("c"),abcs)]
    m = min(length(a),length(b), length(c))

    push!(abct, median(a[1:m]-c[1:m]) / median(b[1:m]-c[1:m]))

The overall strategy is:

  1. Turn the functions A and B we want to compare into anonymous functions.
  2. Create a versus function that sequentially runs the anonymous functions for A and B plus a Control function that returns 0.
  3. Create a population of runs.
  4. Do statistics on such population.

In this example, steps 1. and 2. deal with systematic errors, steps 3. and 4. deal with machine/OS noise.

The example above runs 10,000 samples summarized in 100 runs per A,B and C. If now we run an statistical test on the collected median ratios we have:

julia> using HypothesisTests, Statistics

julia> HypothesisTests.OneSampleTTest(mean(abct),std(abct),length(abct),1.0)
One sample t-test
Population details:
    parameter of interest:   Mean
    value under h_0:         1.0
    point estimate:          1.01123
    95% confidence interval: (1.0094, 1.0131)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-20

    number of observations:   100
    t-statistic:              12.122971746977269
    degrees of freedom:       99
    empirical standard error: 0.0009261004013944243

These results tell us that the point estimate is 1.01123 (the expected one is 1.01), they also tell us with a confidence of 95% that the real ratio is within (1.0094, 1.0131) and that we can be statistically certain that B is faster than A (p-value ≈ 0).

We answered which algorithm is faster, by how much and how certain we are on those results. Now we can publish these results with peace of mind.


Ernest Rutherford once said

If your experiment needs statistics, you ought to have done a better experiment

Little he knew at the time that probability might be an inherent trait of reality, however Rutherford’s attitude is still prevalent in some circles; “Let’s get more data”, “Let’s run more trials”, “Let’s engineer less noise” etc. All with the goal of achieving such accurate results that statistics are not needed anymore… Until they are, and hopefully I made a case for it.


Thanks for the great writeup!

Having confidence intervals accessible easily on my laptop and without the need to break-configure it would be a game changer for me!

A few notes:

  • It seems to me that in your version there is much more evaluations in total, so stating that @benchmark was not able to compare seems a bit misleading.

  • I think that your implementation introduces some systematic error in vs: It will allocate in a deterministic and non-uniform way (possibly solvable by a size hint on abcs), and the ? operator with fixed branches may also be problematic (did not check it)

For many! :slight_smile:

The BenchmarkTools version is using 10_000 samples for A and 10_000 for B

The Statistical Version is using 300 samples (100 A, 100 B, 100 C) 100 times, that makes: 10_000 samples for A, 10_000 for B and 10_000 for C (The control group).

To check that no systematic error was introduced I compared B vs B multiple times and I could not find any statistically significant difference between the two when using 10_000 samples.

However, I imagine that experts in low level features may improve a lot upon what is just meant to be a proof of concept.


True, sorry!

1 Like

First of all, thank you for taking the time to engage with this topic! I really appreciate it and hope you’re not growing tired of me :slight_smile:

Here’s what I’ve noticed in your post/methodology. Let’s begin with a minor issue:

Those numbers are not correct - you’ve only taken what the benchmark has displayed to you for convenience (which are truncated for display) and taken it as the true values. If you take the true numbers reported by a single of your benchmark runs (i.e. evals=1), you’ll get something like the following. A.jl and B.jl are basically the same, except for running one more iteration in the case of A

Files A.jl and B.jl
using BenchmarkTools, Random                                                  
n = 100                                                                       
A(x) = for _ in 1:101 x\x end                                                 
ba = @benchmark A(x) setup=(x=rand(n,n)) evals=1 samples=10_000 seconds=10_000
write("a.out", ba.times)                                                      
using BenchmarkTools, Random                                                  
n = 100                                                                       
B(x) = for _ in 1:100 x\x end                                                 
bb = @benchmark B(x) setup=(x=rand(n,n)) evals=1 samples=10_000 seconds=10_000
write("b.out", bb.times)                                                      
# New session
julia> include("A.jl")                                                     
julia> ba                                                                  
BechmarkTools.Trial: 10000 samples with 1 evaluations.                     
 Range (min … max):  26.974 ms …   5.111 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.601 ms              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   36.863 ms ± 62.174 ms  ┊ GC (mean ± σ):  1.63% ± 2.94%
  ▄▅█ ▁▁                                                                   
  ██████▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂            
  27 ms           Histogram: frequency by time        74.9 ms <            
 Memory estimate: 15.51 MiB, allocs estimate: 505.                         

Now for B:

# New session
julia> include("B.jl")                                                      
julia> bb                                                                   
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                      
 Range (min … max):  27.788 ms … 217.745 ms  ┊ GC (min … max): 0.00% … 0.38%
 Time  (median):     29.346 ms               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   33.293 ms ±  10.183 ms  ┊ GC (mean ± σ):  1.48% ± 2.87%
  █▇▆▅▄▃▃▃▂▂▁▁▁▂▁▁  ▂▁▂                                        ▂            
  ████████████████████████▇███▇██▇▇▇▆▇▇▅▆▅▅▅▅▄▄▄▃▄▄▅▄▅▆▇██▇▆▇▇ █            
  27.8 ms       Histogram: log(frequency) by time      75.7 ms <            
 Memory estimate: 15.35 MiB, allocs estimate: 500.                                                                                                                    

Now for the calculations:

# New session, load data from disk
julia> ba_times = reinterpret(Float64, read("a.out"));
julia> bb_times = reinterpret(Float64, read("b.out"));

julia> using Statistics                     
julia> minimum(ba_times) / minimum(bb_times)
julia> median(ba_times) / median(bb_times)  
julia> mean(ba_times) / mean(bb_times)      

As you can see, median(ba_times) / median(bb_times) already shows about the ratio we’d expect in the first place. Testing that this is significant statistically is hard though, the distribution we get from the pure running times is heavily left leaning, simply because noise introduced by the environment is always additive (there’s no magic going on that can make a program go faster than its minimum possible speed). This is also the reason why neither minimum nor mean show the same value - I’d only expect them to have the same ratio if the samples were normally distributed, which they are not.

As it turns out, we don’t need to change a whole lot to get what we want (a normal distribution of means we can use with standard statistical tests, like the one you’ve used - I’ve upped the allowed execution time because otherwise not all samples would complete on my machine):

julia> ba = @benchmark fa(x) setup=(x=rand(100,100)) evals=100 samples=100 seconds=400
BenchmarkTools.Trial: 100 samples with 100 evaluations.                               
 Range (min … max):  29.955 ms …  38.831 ms  ┊ GC (min … max): 1.32% … 1.00%          
 Time  (median):     30.535 ms               ┊ GC (median):    1.26%                  
 Time  (mean ± σ):   30.728 ms ± 962.884 μs  ┊ GC (mean ± σ):  1.26% ± 0.05%          
     ▂  ▂▁█ ▄   ▅  ▁                                                                  
  ▃▆▃█▆▅███▆█▆▆▆█▆██▅▃▃█▁▅▆▃▃▁▃▃▁▁▁▁▃▁▁▁▁▃▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃ ▃                      
  30 ms           Histogram: frequency by time         32.9 ms <                      
 Memory estimate: 15.51 MiB, allocs estimate: 505.                                    

julia> bb = @benchmark fb(x) setup=(x=rand(100,100)) evals=100 samples=100 seconds=400
BenchmarkTools.Trial: 100 samples with 100 evaluations.                    
 Range (min … max):  29.452 ms … 42.357 ms  ┊ GC (min … max): 1.26% … 0.90%
 Time  (median):     29.996 ms              ┊ GC (median):    1.27%        
 Time  (mean ± σ):   30.242 ms ±  1.328 ms  ┊ GC (mean ± σ):  1.26% ± 0.05%
     ▃ ▃▇▇▂ █                                                              
  ▅▆▇█▆██████▆█▃▆▇▇▆▃▁▃▅▁▁▁▁▃▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃            
  29.5 ms         Histogram: frequency by time        33.3 ms <            
 Memory estimate: 15.35 MiB, allocs estimate: 500.                         

Now for testing the ratio:

julia> b_ratio = ba.times ./ bb.times;                                                  
julia> HypothesisTests.OneSampleTTest(mean(b_ratio), std(b_ratio), length(b_ratio), 1.0)
One sample t-test                                                                       
Population details:                                                                     
    parameter of interest:   Mean                                                       
    value under h_0:         1.0                                                        
    point estimate:          1.01649                                                    
    95% confidence interval: (1.0144, 1.0186)                                           
Test summary:                                                                           
    outcome with 95% confidence: reject h_0                                             
    two-sided p-value:           <1e-27                                                 
    number of observations:   100                                                       
    t-statistic:              15.293253995093568                                        
    degrees of freedom:       99                                                        
    empirical standard error: 0.0010783588270127007                                     

We also get ~1.01, how about that! The reason is pretty simple: using evals=100 means “for each sample, evaluate the given code 100 times and take the mean as the sample value”. In effect, the ratio we’ve taken here is already a ratio of means. This is the same kind of dance you’ve done explicitly, but this time we’re using the existing tool to its fullest extent, saving us from writing it ourselves.

One possible counterargument to what I’ve done here is “Now you’re reusing x for all evaluations! Doesn’t that go against what you’ve said yourself?” and you’d be right! However, since it shows the same result you got, I don’t think reusing the data invalidates the conclusion here. I think it’s only possible to do that in the first place because the function in question does essentially the same work over and over again, artificially inflating the running time by running for one iteration more. The cache is already going to be hot during the vast majority of evaluation time of the function, no matter if we reuse the input variable or not. For other functions where that’s not the case, we have to control for remnants of other invocations, whereas here for this function it doesn’t seem to matter. That being said, one way around this problem is to allocate the inputs as part of the function, measure how long one allocation of inputs takes and tell BenchmarkTools about that overhead in the form of the overhead parameter:

julia> bc = @benchmark (x = rand(100,100); A(x)) evals=100 samples=100 seconds=400 overhead=11200.0
BenchmarkTools.Trial: 100 samples with 100 evaluations.                                            
 Range (min … max):  30.135 ms … 44.170 ms  ┊ GC (min … max): 1.11% … 0.80%                        
 Time  (median):     30.650 ms              ┊ GC (median):    1.13%                                
 Time  (mean ± σ):   30.910 ms ±  1.424 ms  ┊ GC (mean ± σ):  1.11% ± 0.04%                        
       ▂█▅▄  ▅▃                                                                                    
  ▆▃▄▆▅████▆▆██▅▃▇▁▄▃▃▃▁▁▃▃▁▃▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃                                    
  30.1 ms         Histogram: frequency by time        33.8 ms <                                    
 Memory estimate: 15.58 MiB, allocs estimate: 507.                                                 

You’ll notice that thetimings are awfully close to the timings we got earlier (and in this case the overhead is such a small amount of the real execution time, it comfortably falls into the margin of error).

If “doing it right” to get statistically significant results is this easy, even with BenchmarkTools, why don’t we use evals=high_number all the time? Well, when benchmarking we don’t really care about estimating the mean of its statistical execution time when going off to infinity. We mostly care about minimum and maximum times, as well as the variance. This is because what we want, ultimately, is always an overall faster program (ideally with less ressources used). When using the function in the context of other code, we’ll most likely never get anywhere close to the mean execution time - we’re much more likely to be closer to the maximum execution time (if the input data is not cached) or the minimum time (if the data is cached). Luckily for us, since benchmarks are usually heavily left leaning (like I said above, all noise is additive), the mean and median are very often very close to our minimum already! That’s also why in the above, when we use the real timings, we get very similar ratios of about ~1.016 (probably some overhead there due to my slow-ish machine) for all three metrics (even though they’re raw and not means!).

Another aspect (as I’ve mentioned in the other thread) for why we don’t usually bother with more intricate statistical analysis is because this statistic is really only relevant on the exact machine it was run on, with about the same load as originally running. This is because of a multitude of reasons, some of which are

  1. The julia code may be compiled to a different machine code on your machine vs. mine (due to difference in CPU architecture, which we can’t really control by ourselves), but this doesn’t matter in the vast majority of cases (unless you’re optimizing machine code with e.g. LoopVectorization, but at that point it should be understood that the effects are going to be local to the CPU/architecture at hand and most likely will not generalize to arbitrary CPUs)
  2. We’re no absolute masters over our OS - on my machine (windows 10), there are often a bunch of different background tasks running (like anti virus, disk defragmentation, update installation…). If we try really hard to find a time when nothing else is run, our benchmark could be representative for an ideal environment on that specific machine - but comparing that with something someone else is running is only possible superficially (and there are no hard statistics we can use here because of the difference in environment, which induces a difference in population and on it goes).

There’s probably more I forgot, but these two are really the big reasons why no benchmark (not in julia, not in C, not in Rust, not anywhere) should be taken at face value. The speedup of one function vs. another are only real for you if you can replicate them on your machine. That’s not to say that they can’t inform your choices - if one algorithm is asymptotically better than another, you can of course expect it to be faster on your machine as well (and if it isn’t, you should probably ask why). Autovectorization also makes it trickier to take one benchmark from one machine and say “now it’ll also be faster on yours”. Sadly, this is an area where we can’t place confidence bars - either it vectorizes on your CPU or it doesn’t. From the benchmark itself, we can never know!

I feel like you’re misrepresenting my case here - I don’t think statistics are useless, far from it! I’ve just come to understand that it’s not always useful to apply them, simply because statistics often relies on the drawn-from distribution to be the same, which is nigh-impossible to achieve when comparing benchmarking runs from two distinct machines, even if the functions are exactly the same, due to the number of influences that have to be controlled for. That’s why I’m opposed to “throw more stats at it”, not because I think statistics is not useful. That’s also why most often we don’t claim “it’s faster” without also adding something like “on my machine” or “I see this effect”, because there is no absolute measure of faster in the first place.

1 Like

Not at all, I also want to thank you for thoroughly going through my post!

I totally disagree. I’m glad you repeated the calculations -which we cannot hope to always deliver the same results- because those actually are even more misleading from the truth that the ones I shared.

  • Your minimum ratio says that B is around 3% slower → False, B it’s faster.
  • Your median ratio says that A is around 4% faster → Just 1% faster. 4% is way, way off.
  • Your mean ratio says A is 10% faster → Totally off.

So no, these results are not “about the ratio wed’d expect in the first place”, and we would still have all the problems I mentioned if we ever consider to publish them.

This is not true, I am using the median not because I expect it to be normally distributed, but because I expect it to be robust to outliers. As a matter of fact if I was to implement more powerful statistics for these examples I would consider the minimum tail distribution.

We don’t want or need a normal distribution, the Central Limit Theorems guarantees the point estimation of the mean of most distributions to follow a Normal distribution whether they themselves follow a normal distribution or not.

1 Like

Finally! Glad to see you’re using statistics :slight_smile:

Not true, my example also account for control values which improves the ratio estimation and it also deals with systematic errors. Possibly that’s why in my example the real ratio is well within the confidence interval and in your example is well outside.

Yes! It does goes against what you’ve said yourself… multiple times actually! :smiley:

But no, it does not show the same results, and even if it did, in mathematics/logic we can have truths derived from false statements; obtaining truth does not make the reasoning valid.

“what we want” depends on who you’re talking to. I believe that the BenchmarkTools team is now mostly focused on batch processing groups of benchmarks, which makes sense for them to have this priority since they will use this to monitor Julia language core code.

However for researchers that want to publish results they’ll have to go through some of the loops I showed above or have journals rejecting their research.

Not true at all! There is a whole field in Non-Parametric Statistics and if they’re time dependent another whole field!

Well, obviously, if the same benchmarks are run in different CPU’s and one of those CPU’s happen to use operations that benefit one algorithm over another… yeah.

Once we agree on one architecture though, we still need to provide reliable answers about what algorithm runs faster, by how much, and how certain we are on those results, and we can only answer reliably those questions with statistics.

I’m just chiming in to say that I long for some real statistics in BenchmarkTools. First I thought that judge will be the function doing just so, but I was so disappointed when I found out it’s just doing eyeball statistics… I’m not a statistician so I don’t really have an opinion how and what and how should be tested – there’s lots of tricky questions to answer (are we looking at a sample of means/medians? are the samples really independent? means of how many subsamples? what is the distribution? is it really normal? etc). Nonetheless I really hope this discussion will result in some proper use of statistics within BenchmarkTools.

I guess this could be even hidden behind Requires? i.e. those who don’t need it can decide benchmarks by their gut feelings, others could do using HypothesisTests before :wink:

1 Like

Thank you for replying! Please do so in a single post next time, I got a lot of emails yesterday about being mentioned and wondered what was going on :slight_smile: It honestly felt like you’ve written replies as you went and didn’t stop to consider my points, which I hope wasn’t the case!

julia> ba_times = reinterpret(Float64, read("a.out"));
julia> bb_times = reinterpret(Float64, read("b.out"));
julia> minimum(bb_times) > minimum(ba_times)          

My data disagrees :man_shrugging: That’s the exact same data I used for the comparisons in my last post by the way, so a.out is the data from the runs with 101 executions and b.out is the data from the runs with 100 executions.

They’re within 3%, for something as fluctuating as timings on a ~5 year old laptop with thermal throttling I’d call that pretty good and “about what I’d expect”. The “real” ratio of 1.01 only exists on an ideal computer, which I (sadly) don’t own. I’d pay good money for one though. I ran this exact benchmark (yours and mine) multiple times over the past few days and I’ve seem everything from exactly 1.01, to 1.1 to 0.99 from both of our “benchmarks”, with statistical analysis.

Now I’m confused - we do need something that’s normally distributed, else we can’t apply a t-test in the first place, right? Yes, CLT guarantees that the point estimated mean is normally distributed, but surely you’ve used this because it’s more convenient for applying a standard statistical test than using a more general/complex test, no?

I mean, I won’t fudge my data to show something that wasn’t there. That run really did have those timings on my machine and that’s the result I got. I don’t think this is a systematic error either - my laptop was running very hot for both ba as well as bb (and was probably thermally throttling), so I’d say it’s much more likely that due to the code that was run, A really was that much slower (Occam’s razor applies). After all, we will only get a perfect 1.01 on an ideal (or sufficiently perfect/calm) machine, which mine certainly isn’t.

I don’t think it’s fair to only quote the part where I said that “yes, this may go against what I said”, without also considering the part where I give an explanation for why that’s fine in this case.

…and that’s precisely why they have to understand the tool they’re using (BenchmarkTools), its limitations as well as its features to make their life easier (which is literally all I’m arguing here).

And are we all expected now to know about them in order to use BenchmarkTools? Don’t get me wrong, using those tools when you know them is fine. Requiring everyone else to also know & understand them is not. I also don’t expect everyone on this forum to have an understanding of the intricacies of modern computers, even though that’s literally what everyone here is using and I’ve spent far too much time trying to understand them (which honestly could have been spent better). They’re quite complicated, you know?

You’re still missing the point. Even on the same machine, with the same code, you can get very different results between executions. No amount of error bars or statistics is going to save you from running the code on a hot machine vs. a cold machine. Yes, we could run the code for ages to get rid of those noises as well, but now we’d be overfitting to that machine (which can result in other artefacts, like the 1.04 I posted above). At some point, it’s time to acknowledge that you can’t account for all variables and that getting an error bar on some timing is infeasible.

Finally, throughout all this, you’ve forgotten one crucial thing: You went in knowing what to expect. When actually running code, modifying it and checking if it’s faster, you don’t a priori know how much faster the code should be. You can make some educated guess based on what you think the computer is doing, but that doesn’t (and in the absolute vast majority cases won’t) necessarily translate to real world gains. Moreover, you don’t have a magic “the speedup should be 1.01” number/statement you can throw into some statistical test to “prove” that something is off. You can only check what real timing difference exists and try to come up with explanations for why that is.

Moreover, as I keep telling you, your analysis is not representative of what people generally try to achieve with benchmarking. The purpose of benchmarking is finding out how fast a given code is in a typical use case of that piece of code, with typical data. If solving the exact same linear system 100 times in a row is your use case, great! But that’s not the kind of use case people generally try to use BenchmarkTools for.

I’m sorry to interrupt you right here, but I don’t think this is the place for BenchmarkTools to be in. Doing any sort of statistics on the raw timing data requires knowing what sort of distribution you’re expecting (or at least fitting one). Choosing the wrong test for the wrong distribution (which almost always assumes the distribution you’re testing is some standard, well-behaved distribution) can have catastrophic consequences. I don’t see how BenchmarkTools should be able to conjure up what the distribution of timing of your function A looks like, what the distribution for B looks like and on top of that know how to decide which of the two is more advantageous for your specific use case. Even assuming BenchmarkTools could figure these distributions out, how would it decide which part of the distribution you care about?

  • Do you care about the maximum?
    • i.e. the long tail, what people care about when writing web applications. Maybe this one function that’s slow in 5% of cases is slowing everything down for people that have bad internet because they’re the ones always hitting that slow code path.
  • Do you care about the minimum?
    • maybe because you have a cryptographic function that shouldn’t branch and thus should run in constant time or maybe because what you’re timing is some function that should take at least n milliseconds and if it doesn’t something is seriously wrong in your code
  • Do you care about about minimum and maximum being close together, i.e. low variance?
    • Maybe you have a frame budget for rendering in a video game and you can only take between 3-4 milliseconds, where it’s not important how long exactly but it is important to know the ballpark pretty well because it makes frame time more predictable. In those cases, sometimes a slower function that has predictable execution time is more desirable than a fast function that is slow in 5% or even 1% of cases. If you have more than a few of those unpredictable beasts, you can say goodbye to smooth frames and welcome to stuttering and bad visuals.

These all have to be taken into consideration when benchmarking, and I don’t think BenchmarkTools is the place to put that sort of analysis code. Its job is to collect timing data in a predictable way, with the code you’ve given it - the analysis is always on the part of the analyzer (though a lot of times the small, pretty statistics that are printed are helpful enough for the use cases in question).

Purely by the nature of running the same code in quick succession over and over again on the same machine, I can be quite confident in saying: no. If you want them to be somewhat close to independent, let the system come to rest, restore a state of chaos and randomness in cache & memory, wait 15-20 minutes for thermals to cool down and only then draw your second sample. then repeat the dance and draw your thrid. You’ll quickly find this exercise to be quite cumbersome.

1 Like

I don’t like long answers, and I prefer to respond one comment at a time. I have considered you points, I just disagree with them. However I am starting to have the same feeling about how much you’re are considering my points.

My laptop is seven years old and within 3% when the difference is 1% is way, way off

Well, in other posts you were calling my results wonky for using multiple evals and not knowing how to use BenchmarkTools, I just thought it was funny your resorted to “Wonkyness” to make a point… I still think it is funny :smiley:

The machine is not the problem.

Only anyone serious about publishing… or making any credible statements for that matter.

Anyway, thank you @Sukera for considering my arguments, however I don’t think I can make them any clearer to convince you. On my side I have to say that I find none of your arguments convincing enough to change my opinion about the limitations of BenchmarkTools when it comes to compare benchmarks and, since I don’t like long answers, perhaps it is best if we leave it as it is for this topic! Thanks again!

Dear @Sukera,
Thanks for taking so much space to reply to my innocent words of encouragement for this discussion :wink:
In principle I agree with almost all what you’ve written. Benchmarking is hard and drawing conclusions from results is even harder. After contemplating this hopeless situation we also need to acknowledge that most of us need to draw these conclusions (several times?) every day, and BenchmarkingTools is precisely the tool to reduce the amount of gut feeling in the decision making ;). I understand that @viraltux suggested precisely this.

Note that it seems that BenchmarkTools readme contradicts the limited scope You assigned to it. Let me quote:

BenchmarkTools makes performance tracking of Julia code easy by supplying a framework for writing and running groups of benchmarks as well as comparing benchmark results .

Emphasis is not added :slight_smile:

I agree with you that what we compare, under what assumptions, and what is more advantageous is up to the analyzer. Nonetheless, how we compare (under those assumptions) and what are the conclusions is part of statistics.

The common practice of BenchmarkTools.jl for 90% us is exactly by eyeballing median/minimu/etc and then, make a decision based on gut feeling :wink: The next 4.5% know about judge function therein. The remaining 5% eyeball the numbers for a long moment and (with audible sigh and slight sunk of shoulders)

Then ask

does “faster” even has meaning? or is it only in the mind of the benchmarker? How could your silicone be so different than mine?

Jokes aside :slight_smile: BenchmarkTools.jl prints some of the statistics of the collected samples for us to indulge in eyeball statistics enhanced decision making. Recently it gained beautiful histograms, and most of us love them. Why? I believe that these decrease the amount of “gut” in the decision we’re making. Again, I see the whole discussion here going exactly in this direction.

A more concrete proposal: since BenchmarkTools.judge is right now simply comparing ratios, we could test normality of both distributions and then either warn user of the departure from normality, or perform some hypothesis testing on equality of minimums/means/medians (there are tests for all of these), depending on what the user asked (remember this is part of what, not how). Again, don’t ask me for precise procedures to poke holes in, I’m not doing statistics for living :wink:

I absolutely agree. Now the question is: given this information do you refuse to make any further action, or do you use it to your advantage? Since we know that our cpus throttle, have cascades of caches and so on we could take advantage of this knowledge! (which is what @viraltux tried to convey, or at least this is how I understood him). E.g. to get rid of throttling effect we could interweave execution of f, g and have a look at the distribution of paired differences. Warm cache? add cache flush in between, or interweave additionally a control function. Or test if cache flushing in between has real effect ;).

Will this solve all of the problems you mentioned? Of course not! Yet I argue that this is a wrong question to ask. The better question would be: would the amount of “gut” in the decisions based on benchmarks be higher or lower? I’d argue that it’d be much lower. I think the point you missed is not about being always 100% correct, but to enhance the tool to make better decision in a few % more cases.

No data we collect is truly iid, but good enough is good enough :wink: If you’re happy with making decisions based on minimum/median/histogram of warm-cache non-iid samples, then surely you’ll be happy if someone tries hard to make the samples a bit more independent. And if you really want to wait for iid samples 15 minutes, please do so. I can’t, I have some decisions to make :stuck_out_tongue:

1 Like