I’m trying to write and benchmark sorting algorithms, and am having some trouble getting consistent benchmark results. I’ve simplified my problem to defining two identical sorting methods using the interface from base/sort.jl—HopeSort1 and HopeSort2—each of which do nothing, but consistently benchmark differently. Here’s a MWE (I have yet to replicate this without the sort! API):

using Base.Order
import Base.sort!
struct HopeSort1Alg <: Base.Algorithm end
const HopeSort1 = HopeSort1Alg()
struct HopeSort2Alg <: Base.Algorithm end
const HopeSort2 = HopeSort2Alg()
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering)
v
end
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering)
v
end
using BenchmarkTools
list = rand(50000);
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))

It seems to me that within a Julia session one of my implementations is consistently faster than the other. Which is faster and by how much changes session to session.

I think it’s best to use @benchmark instead to check on the variability of results, when we do that there seems not to be any significant differences (e.g. 61.757 μs ± 6.711 μsvs59.950 μs ± 6.043 μs)

Pardon my approximate statistics, but I believe that with n=10000 samples, and a sample standard deviation of σ ≈ 6 we get a standard error of σ/sqrt(n) ≈ .06, which puts the sample means about 30 standard errors apart—a statistically significant difference.

Practically, I’d like a way to get a measurement with a higher accuracy than allowed by this bias (in your case 3%) by repeating the experiment and taking averages without having to restart Julia repeatedly.

BenchmarkTools just calls Statistics.std, so the 6.771 \mu s already has been divided by \sqrt{n}. I would link you to a specific line in the source, but there are a lot of different display methods and I’d rather not point you to the wrong place. But @viraltux’s point that those are within error is correct. An easy way to convince yourself about this, by the way, would be to just run that script again (and see the variability).

Sure they’re pardoned. Those statistics work though under the assumption that the mean we’re estimating is not time dependent which is not the case; a computer and its operating system is under all sort of parallel work that changes one execution to the next. Not only the mean statistic is time dependent but so is the distribution, in this context we cannot claim a “statistically significant difference” with those calculations.

I would say that often it is fair to assume that algorithms which estimated means fall well within one standard deviation of a benchmark having similar distributions are “equivalent”.

Restarting Julia might give us some common starting point as far as Julia is concerned, but Julia is not in control of the OS and all the other processes going around in our computers which means we will still see computer variance in our results.

If we’re not okay with “equivalent” and our algorithms happen to be so close in performance that the computer variability does not allow us to clearly see which one is faster, then there are a few alternatives:

Calculate the computational complexity: Time complexity - Wikipedia and prove one is theoretically faster. (This one would solve your simple example in this question)

Make statistics about the statistics; for example, studying how statistics like the mean and variance change in your system over time. (There are some modelling challenges here)

Run the benchmark for a very long time to make time dependency negligible and then directly compare the median (The easiest).

Anyway, maybe someone may offer more alternatives. good luck Lilith.

I think actually that’s the standard deviation, I believe she is talking about the standard error of the estimation for the mean… Here we go, somebody help us!

I didn’t read that source code very carefully, but I have to imagine that what BenchmarkTools is giving you is the point estimate for the mean and the empirical estimate for the standard deviation of the mean, and I figure the code is organized so that that’s what Statistics.std is doing.

In any case, though, we definitely don’t need to do much investigating to see that those two numbers are not thirty standard deviations apart. Because you can just run those few benchmark lines a handful of times at your REPL and get numbers that would be astronomically unlikely if that were true. I just ran the HopeSort2 benchmark a few times and got means that were 3 \mu s apart from each other.

this is the correct answer. In short just think there’s an intrinsic error caused by computer isn’t perfectly stable (not to mention your operating system and everything), so the dominant error in this case is not statistical fluctuation

This would tell me both algorithms are O(1). in this case I care about the constant factor—hence benchmarking.

Yes! I think you have pinpointed how I’ve been thinking about this. I’ve been trying this approach, and it is giving me misleading conclusions. Here’s a MWE that benchmarks the same executions as the last one, but explicitly calculates statistics of the statistics over time.

using Base.Order
import Base.sort!
struct HopeSort1Alg <: Base.Algorithm end
const HopeSort1 = HopeSort1Alg()
struct HopeSort2Alg <: Base.Algorithm end
const HopeSort2 = HopeSort2Alg()
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering)
v
end
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering)
v
end
using BenchmarkTools
using Statistics: std
using Printf: @printf
list = rand(50000);
times = Float64[], Float64[]
try
while true
for f in [length, mean, std]
@printf "%6s: %9.4g, %9.4g\n" f f(times[1]) f(times[2])
end
t,alg = rand() < .5 ? (times[1], HopeSort1) : (times[2], HopeSort2)
push!(t, @belapsed sort!(x; alg = $alg) setup = (x=copy($list)) seconds=1)
end
catch InterruptException end
display(times[1])
display(times[2])
@printf "%f .. %f\n" minimum(times[1]) maximum(times[1])
@printf "%f .. %f\n" minimum(times[2]) maximum(times[2])
using StatsPlots: violin
violin(
vcat(fill("HopeSort1",length(times[1])),
fill("HopeSort2",length(times[2]))),
vcat(times...),
legend=false)

These results are consistent over time within a Julia session (as you see here), but not between sessions.

I did that, and was shocked to discover the problem persists. Here is a MWE (which produces different results in different Julia sessions):

using Base.Order
import Base.sort!
struct HopeRollSort1Alg <: Base.Algorithm end
const HopeRollSort1 = HopeRollSort1Alg()
struct HopeRollSort2Alg <: Base.Algorithm end
const HopeRollSort2 = HopeRollSort2Alg()
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeRollSort1Alg, o::Ordering)
if hi > lo
first = v[1]
for i in lo:(hi-1)
v[i] = v[i + 1]
end
v[hi] = first
end
end
function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeRollSort2Alg, o::Ordering)
if hi > lo
first = v[1]
for i in lo:(hi-1)
v[i] = v[i + 1]
end
v[hi] = first
end
end
using BenchmarkTools
list = rand(50000);
function perf_sort(v; alg)
for i in 1:50000
sort!(v; alg=alg)
end
end
@time perf_sort(copy(list), alg = HopeRollSort1)
@time perf_sort(copy(list), alg = HopeRollSort2)
display(@benchmark perf_sort(x, alg = HopeRollSort1) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort2) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort1) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort2) setup = (x=copy($list)) seconds = 30)

and here are the results

Press Enter to start a new session.
Starting Julia...
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _ | |
| | |_| | | | (_| | | Version 1.6.1 (2021-04-23)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
6.286187 seconds (69.58 k allocations: 4.615 MiB, 0.77% compilation
6.513327 seconds (58.38 k allocations: 3.946 MiB, 1.02% compilation time)
BechmarkTools.Trial: 6 samples with 1 evaluations.
Range (min … max): 5.430 s … 5.812 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.494 s ┊ GC (median): 0.00%
Time (mean ± σ): 5.532 s ± 140.449 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ ██ ██
█▁▁█▁▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
5.43 s Histogram: frequency by time 5.81 s <
Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 5 samples with 1 evaluations.
Range (min … max): 6.197 s … 6.498 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.242 s ┊ GC (median): 0.00%
Time (mean ± σ): 6.289 s ± 123.990 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
██ █ █ ██
██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
6.2 s Histogram: frequency by time 6.5 s <
Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 6 samples with 1 evaluations.
Range (min … max): 5.448 s … 5.482 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.470 s ┊ GC (median): 0.00%
Time (mean ± σ): 5.466 s ± 15.227 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▁ ▁ ▁▁
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁█ ▁
5.45 s Histogram: frequency by time 5.48 s <
Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 5 samples with 1 evaluations.
Range (min … max): 6.144 s … 6.614 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.184 s ┊ GC (median): 0.00%
Time (mean ± σ): 6.274 s ± 193.967 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ██ █ ██
█▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
6.14 s Histogram: frequency by time 6.61 s <
Memory estimate: 0 bytes, allocs estimate: 0.

Which seem to indicate that HopeRollSort1 is 12%-15% faster than HopeRollSort2, and that that they are certainly not the same.

I’m beginning to suspect something along the lines of when these two semantically identical functions are compiled, one of them gets advantaged somehow in a measurable way.

This doesn’t have anything to do with one function being advantaged over another, it’s probably the abstraction(s) of technologies your computer uses leaking up the stack.

When code is compiled, it’s placed somewhere in memory. Depending on where it’s placed, it can be accessed faster or slower (e.g. because it’s placed closer to where the benchmarking code resides). This isn’t something that’s exposed through julia and hence the abstraction of the julia runtime/context of execution is leaking that low level detail of how code is placed in and loaded from memory (and maybe looked up during benchmarking as well). I’m almost certain if you place HopeRollSort1AlgafterHopeRollSort2Alg, you could sometimes observe that HopeRollSort2Alg is faster.

At the moment, julia does not try to reorder compiled code based on how often it’s called, as far as I know.

If you know where and how to look, you can find these kinds of leaky abstractions all over the place - and how Joel Spolsky put it:

All non-trivial abstractions, to some degree, are leaky.

Boy is this turning into an interesting thread. I cannot reproduce the performance discrepancy you are observing. I’m on an intel i5-11600K on an official linux build, and my timings for the first two @benchmarks were

Time (mean ± σ): 3.163 s ± 44.875 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
[...]
Time (mean ± σ): 3.181 s ± 69.147 ms ┊ GC (mean ± σ): 0.00% ± 0.00%

which, again, assuming the BenchmarkTools is doing the sane thing and giving you the estimated standard deviation for the estimator of the mean, are very much equal within error. I would definitely conclude that those have the same performance on my machine.

@jling also gets at this and I acknowledge your response, but just out of curiosity, is it possible that you are on a laptop with stressed thermals or something? Like, are you experiencing periodic throttling or something?

I’m on a Dual-Core Intel Core i5 on macOS 11.4 (20F71)

I don’t see how periodic throttling could explain the results from my second MWE. It repeatedly selects which function to benchmark at random—recording the measurement and which function was measured. The results I got (included with the MWE in my post above) were an accumulated 14 measurements of one function in the range 0.000358 … 0.000359 and 18 measurements of the other in the range 0.000278 … 0.000282. Given the random sample order I don’t think it’s possible this is a result of periodic throttling unless the choice of function triggers the throttling.

I think nonetheless your question raises a very good point. Most of the time the reason we use BenchmarkTools is not because we want to know how fast is A but rather if A is faster than B and by how much. A very good addition in my opinion to BenchmarkTools would be a macro to compare A vs B vs… X instead us guessing if one is faster than the others based on their statistics. This macro would also allow for internal bias reduction (reloading A and B and…, etc.) and running such macro for a long time should account as well for the whole machine/OS potential bias.

For now, going to the statistics on statistics idea we can run for a long time things like:

A() = for i in 1:1000 sin(i) end
B() = for i in 1:999 sin(i) end
using BenchmarkTools
n = 10
d = []
for i in 1:n
println(i)
push!(d,(@belapsed A()) - (@belapsed B()))
end

Also, since you’re working on this and you raised this problem, if I may I would suggest for you to open an issue to the BenchmarkTools team. I’d support the idea of a macro like @benchmark A B …

FWIW, I don’t think it’s necessarily representative if you only have 14-18 measurements. The benchmarks posted by @viraltux earlier had 10.000 measurements, allowing for a much more representative sample on their machine.

I’m worried that this would also report a very low p value for my two identical functions for the reason @Sukeramentioned.

I must not have been clear. Each of my 14+18 trials consists of 10_000 measurements. Before posting, I also ran the same program multiple times. In the longest run got a total of 84 trials consisting of 10_000 measurements each and spanned several minutes of wall time. The pattern I reported (the fastest runtime of one is slower than the slowest runtime of the other) was also present in that run.