It seems to me that we should have the technology to determine if a change in the performance of a function is statistically significant, possibly with some user specified assumptions.

Has anyone made a workflow or little package for this?

It seems to me that we should have the technology to determine if a change in the performance of a function is statistically significant, possibly with some user specified assumptions.

Has anyone made a workflow or little package for this?

4 Likes

Note that benchmarking data is non-iid and non-normal.

So something like a t-test would be inappropriate.

So you likely want two-sample Kolmogorov-Smirnov or Anderson-Darling, but I would need to double check if either have normality assumptions.

4 Likes

Permutation test should be most robust statistically. Also, to mitigate the effect of always varying computer load, running two functions in an interleaved way would probably help. Like, 1000x run `f()`

, 1000x run `g()`

, repeat these 100x.

3 Likes

Is the full distribution of run times kept or just the histogram? If you have all times, might as well just use it as a â€śbootstrapâ€ť distribution. No need to rely on distributional assumptions if you have the distribution.

We have all data points available. Do you have a reference for a bootstrap method?

The problem is described as future work in [1608.04295] Robust benchmarking in noisy environments IIRC.

Having a more reliable comparison between two benchmark runs would be very powerful.

How many repetitions do you normally have of a given benchmark? The documentation states:

`samples`

: The number of samples to take. Execution will end if this many samples have been collected. Defaults to`BenchmarkTools.DEFAULT_PARAMETERS.samples = 10000`

.

If that is a typical sample size, one concern is that the test would be sensitive to very small discrepancies. Another thing to consider the number of flagged benchmarks. In a large suite of benchmarks, you may get a large number of spurious flags (approximately 5% under most assumptions). I wonder if an effect size statistic might be a better way to evaluate differences?

Also related: Making @benchmark outputs statistically meaningful, and actionable

And the corresponding issue on GitHub: use legitimate non-iid hypothesis testing Â· Issue #74 Â· JuliaCI/BenchmarkTools.jl Â· GitHub

Donâ€™t mind me, Iâ€™m just doing some sweet sweet cross-referencing

3 Likes