Identical functions repeated benchmarks show systematic differences

I’m trying to write and benchmark sorting algorithms, and am having some trouble getting consistent benchmark results. I’ve simplified my problem to defining two identical sorting methods using the interface from base/sort.jl—HopeSort1 and HopeSort2—each of which do nothing, but consistently benchmark differently. Here’s a MWE (I have yet to replicate this without the sort! API):

using Base.Order
import Base.sort!

struct HopeSort1Alg <: Base.Algorithm end
const HopeSort1 = HopeSort1Alg()
struct HopeSort2Alg <: Base.Algorithm end
const HopeSort2 = HopeSort2Alg()

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering)
	v
end

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering)
	v
end


using BenchmarkTools

list = rand(50000);
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("2: ", @belapsed sort!(x; alg = HopeSort2) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))
println("1: ", @belapsed sort!(x; alg = HopeSort1) setup = (x=copy($list)))

Prints:

2: 8.9214e-5
1: 7.4838e-5
2: 9.1911e-5
2: 9.196e-5
1: 7.4876e-5
1: 7.2778e-5

And when I run it again from a fresh Julia session it prints:

2: 7.1904e-5
1: 8.9535e-5
2: 6.5649e-5
2: 6.9556e-5
1: 8.6756e-5
1: 8.6921e-5

And again:

2: 6.9876e-5
1: 8.6794e-5
2: 6.804e-5
2: 6.7724e-5
1: 9.2424e-5
1: 7.9681e-5

It seems to me that within a Julia session one of my implementations is consistently faster than the other. Which is faster and by how much changes session to session.

Any idea why this is?

Hi @Lilith

I think it’s best to use @benchmark instead to check on the variability of results, when we do that there seems not to be any significant differences (e.g. 61.757 μs ± 6.711 μs vs 59.950 μs ± 6.043 μs)

julia> list = rand(50000);

julia> @benchmark sort!(x; alg = HopeSort2) setup = (x=copy($list))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  56.810 μs … 140.647 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.792 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   61.757 μs ±   6.711 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▇▆▆▆▆▆▆▅▄▃▂   ▁▂▂▁▁▁▁▁ ▁▁                                 ▂
  ████████████████▇██████████████▇▇▇▆▆▆▆▆▆▇█▇▇▆▆▆▆▄▅▅▅▄▅▅▅▄▄▅▆ █
  56.8 μs       Histogram: log(frequency) by time      92.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x; alg = HopeSort1) setup = (x=copy($list))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  55.849 μs … 258.644 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     58.434 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.950 μs ±   6.043 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▄▇█▇▆▄▃▃▃▄▃▁        ▂                                      ▂
  ▇██████████████▇▆▅▅▄▆▇███▇██▇▇▆█▆▇▇▄▆▄▅▃▄▃▄▅▄▅▅▅▅▆▅▄▅▃▄▄▄▃▄▅ █
  55.8 μs       Histogram: log(frequency) by time      86.8 μs <
2 Likes

61.757 μs ± 6.711 μs vs 59.950 μs ± 6.043 μs

Pardon my approximate statistics, but I believe that with n=10000 samples, and a sample standard deviation of σ ≈ 6 we get a standard error of σ/sqrt(n) ≈ .06, which puts the sample means about 30 standard errors apart—a statistically significant difference.

Practically, I’d like a way to get a measurement with a higher accuracy than allowed by this bias (in your case 3%) by repeating the experiment and taking averages without having to restart Julia repeatedly.

BenchmarkTools just calls Statistics.std, so the 6.771 \mu s already has been divided by \sqrt{n}. I would link you to a specific line in the source, but there are a lot of different display methods and I’d rather not point you to the wrong place. But @viraltux’s point that those are within error is correct. An easy way to convince yourself about this, by the way, would be to just run that script again (and see the variability).

1 Like

Sure they’re pardoned. Those statistics work though under the assumption that the mean we’re estimating is not time dependent which is not the case; a computer and its operating system is under all sort of parallel work that changes one execution to the next. Not only the mean statistic is time dependent but so is the distribution, in this context we cannot claim a “statistically significant difference” with those calculations.

I would say that often it is fair to assume that algorithms which estimated means fall well within one standard deviation of a benchmark having similar distributions are “equivalent”.

Restarting Julia might give us some common starting point as far as Julia is concerned, but Julia is not in control of the OS and all the other processes going around in our computers which means we will still see computer variance in our results.

If we’re not okay with “equivalent” and our algorithms happen to be so close in performance that the computer variability does not allow us to clearly see which one is faster, then there are a few alternatives:

  • Calculate the computational complexity: Time complexity - Wikipedia and prove one is theoretically faster. (This one would solve your simple example in this question)
  • Make statistics about the statistics; for example, studying how statistics like the mean and variance change in your system over time. (There are some modelling challenges here)
  • Run the benchmark for a very long time to make time dependency negligible and then directly compare the median (The easiest).

Anyway, maybe someone may offer more alternatives. good luck Lilith.

3 Likes

I think actually that’s the standard deviation, I believe she is talking about the standard error of the estimation for the mean… Here we go, somebody help us! :smile:

I didn’t read that source code very carefully, but I have to imagine that what BenchmarkTools is giving you is the point estimate for the mean and the empirical estimate for the standard deviation of the mean, and I figure the code is organized so that that’s what Statistics.std is doing.

In any case, though, we definitely don’t need to do much investigating to see that those two numbers are not thirty standard deviations apart. Because you can just run those few benchmark lines a handful of times at your REPL and get numbers that would be astronomically unlikely if that were true. I just ran the HopeSort2 benchmark a few times and got means that were 3 \mu s apart from each other.

this is the correct answer. In short just think there’s an intrinsic error caused by computer isn’t perfectly stable (not to mention your operating system and everything), so the dominant error in this case is not statistical fluctuation

1 Like

This would tell me both algorithms are O(1). in this case I care about the constant factor—hence benchmarking.

Yes! I think you have pinpointed how I’ve been thinking about this. I’ve been trying this approach, and it is giving me misleading conclusions. Here’s a MWE that benchmarks the same executions as the last one, but explicitly calculates statistics of the statistics over time.

using Base.Order
import Base.sort!

struct HopeSort1Alg <: Base.Algorithm end
const HopeSort1 = HopeSort1Alg()
struct HopeSort2Alg <: Base.Algorithm end
const HopeSort2 = HopeSort2Alg()

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering)
	v
end

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering)
	v
end


using BenchmarkTools
using Statistics: std
using Printf: @printf

list = rand(50000);
times = Float64[], Float64[]
try
	while true
		for f in [length, mean, std]
			@printf "%6s: %9.4g, %9.4g\n" f f(times[1]) f(times[2])
		end

		t,alg = rand() < .5 ? (times[1], HopeSort1) : (times[2], HopeSort2)
		push!(t, @belapsed sort!(x; alg = $alg) setup = (x=copy($list)) seconds=1)
	end
catch InterruptException end

display(times[1])
display(times[2])
@printf "%f .. %f\n" minimum(times[1]) maximum(times[1])
@printf "%f .. %f\n" minimum(times[2]) maximum(times[2])

using StatsPlots: violin
violin(
	vcat(fill("HopeSort1",length(times[1])),
		 fill("HopeSort2",length(times[2]))),
	vcat(times...),
	legend=false)

With output

Starting Julia...
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 1.6.1 (2021-04-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |
length:         0,         0
  mean:       NaN,       NaN
   std:       NaN,       NaN
length:         0,         1
  mean:       NaN, 0.0002785
   std:       NaN,       NaN
length:         0,         2
  mean:       NaN, 0.0002797
   std:       NaN, 1.764e-06
length:         1,         2
  mean: 0.0003589, 0.0002797
   std:       NaN, 1.764e-06
length:         2,         2
  mean: 0.0003586, 0.0002797
   std: 3.012e-07, 1.764e-06
length:         2,         3
  mean: 0.0003586, 0.0002801
   std: 3.012e-07, 1.395e-06
length:         3,         3
  mean: 0.0003587, 0.0002801
   std: 2.423e-07, 1.395e-06
length:         4,         3
  mean: 0.0003588, 0.0002801
   std: 2.804e-07, 1.395e-06
length:         4,         4
  mean: 0.0003588, 0.0002797
   std: 2.804e-07, 1.356e-06
length:         4,         5
  mean: 0.0003588,   0.00028
   std: 2.804e-07, 1.364e-06
length:         5,         5
  mean: 0.0003588,   0.00028
   std: 2.458e-07, 1.364e-06
length:         5,         6
  mean: 0.0003588, 0.0002799
   std: 2.458e-07, 1.235e-06
length:         6,         6
  mean: 0.0003589, 0.0002799
   std: 2.793e-07, 1.235e-06
length:         6,         7
  mean: 0.0003589, 0.0002801
   std: 2.793e-07, 1.248e-06
length:         7,         7
  mean: 0.0003589, 0.0002801
   std: 2.668e-07, 1.248e-06
length:         7,         8
  mean: 0.0003589,   0.00028
   std: 2.668e-07, 1.245e-06
length:         8,         8
  mean:  0.000359,   0.00028
   std: 2.737e-07, 1.245e-06
length:         8,         9
  mean:  0.000359, 0.0002798
   std: 2.737e-07, 1.273e-06
length:         8,        10
  mean:  0.000359, 0.0002799
   std: 2.737e-07, 1.281e-06
length:         8,        11
  mean:  0.000359,   0.00028
   std: 2.737e-07, 1.246e-06
length:         8,        12
  mean:  0.000359,   0.00028
   std: 2.737e-07, 1.199e-06
length:         8,        13
  mean:  0.000359, 0.0002801
   std: 2.737e-07, 1.203e-06
length:         9,        13
  mean: 0.0003589, 0.0002801
   std: 3.684e-07, 1.203e-06
length:        10,        13
  mean: 0.0003589, 0.0002801
   std: 3.525e-07, 1.203e-06
length:        10,        14
  mean: 0.0003589, 0.0002801
   std: 3.525e-07, 1.159e-06
length:        10,        15
  mean: 0.0003589,   0.00028
   std: 3.525e-07, 1.148e-06
length:        10,        16
  mean: 0.0003589, 0.0002799
   std: 3.525e-07, 1.211e-06
length:        10,        17
  mean: 0.0003589,   0.00028
   std: 3.525e-07, 1.205e-06
length:        10,        18
  mean: 0.0003589, 0.0002801
   std: 3.525e-07, 1.235e-06
length:        11,        18
  mean: 0.0003589, 0.0002801
   std: 3.345e-07, 1.235e-06
length:        12,        18
  mean: 0.0003589, 0.0002801
   std:  3.27e-07, 1.235e-06
length:        13,        18
  mean: 0.0003588, 0.0002801
   std: 4.554e-07, 1.235e-06
length:        14,        18
  mean: 0.0003589, 0.0002801
   std: 4.497e-07, 1.235e-06
^C14-element Vector{Float64}:
 0.000358857
 0.000358431
 0.000358844
 0.000359108
 0.000358895
 0.000359249
 0.000359105
 0.000359261
 0.000358174
 0.00035907
 0.000358872
 0.000359146
 0.000357725
 0.000359215
18-element Vector{Float64}:
 0.000278462
 0.000280956
 0.000280791
 0.000278598
 0.000281255
 0.00027954
 0.000281347
 0.000278821
 0.000278431
 0.000281212
 0.000280861
 0.000279468
 0.000281279
 0.000280378
 0.000279061
 0.000278087
 0.000281058
 0.000281661
0.000358 .. 0.000359
0.000278 .. 0.000282

These results are consistent over time within a Julia session (as you see here), but not between sessions.

I did that, and was shocked to discover the problem persists. Here is a MWE (which produces different results in different Julia sessions):

using Base.Order
import Base.sort!

struct HopeRollSort1Alg <: Base.Algorithm end
const HopeRollSort1 = HopeRollSort1Alg()
struct HopeRollSort2Alg <: Base.Algorithm end
const HopeRollSort2 = HopeRollSort2Alg()

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeRollSort1Alg, o::Ordering)
	if hi > lo
		first = v[1]
		for i in lo:(hi-1)
			v[i] = v[i + 1]
		end
		v[hi] = first
	end
end

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeRollSort2Alg, o::Ordering)
	if hi > lo
		first = v[1]
		for i in lo:(hi-1)
			v[i] = v[i + 1]
		end
		v[hi] = first
	end
end


using BenchmarkTools

list = rand(50000);
function perf_sort(v; alg)
	for i in 1:50000
		sort!(v; alg=alg)
	end
end

@time perf_sort(copy(list), alg = HopeRollSort1)
@time perf_sort(copy(list), alg = HopeRollSort2)
display(@benchmark perf_sort(x, alg = HopeRollSort1) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort2) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort1) setup = (x=copy($list)) seconds = 30)
display(@benchmark perf_sort(x, alg = HopeRollSort2) setup = (x=copy($list)) seconds = 30)

and here are the results

Press Enter to start a new session.
Starting Julia...
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 1.6.1 (2021-04-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |
  6.286187 seconds (69.58 k allocations: 4.615 MiB, 0.77% compilation
  6.513327 seconds (58.38 k allocations: 3.946 MiB, 1.02% compilation time)
BechmarkTools.Trial: 6 samples with 1 evaluations.
 Range (min … max):  5.430 s …    5.812 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.494 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.532 s ± 140.449 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █   █   ██                                            ██ 
  █▁▁█▁▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  5.43 s         Histogram: frequency by time         5.81 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 5 samples with 1 evaluations.
 Range (min … max):  6.197 s …    6.498 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.242 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.289 s ± 123.990 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██      █           █                                    ██ 
  ██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  6.2 s          Histogram: frequency by time          6.5 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 6 samples with 1 evaluations.
 Range (min … max):  5.448 s …   5.482 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.470 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.466 s ± 15.227 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                          ▁                ▁         ▁ ▁▁ 
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁█ ▁
  5.45 s         Histogram: frequency by time        5.48 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
BechmarkTools.Trial: 5 samples with 1 evaluations.
 Range (min … max):  6.144 s …    6.614 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.184 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.274 s ± 193.967 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ██       █                                           ██ 
  █▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  6.14 s         Histogram: frequency by time         6.61 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which seem to indicate that HopeRollSort1 is 12%-15% faster than HopeRollSort2, and that that they are certainly not the same.

I’m beginning to suspect something along the lines of when these two semantically identical functions are compiled, one of them gets advantaged somehow in a measurable way.

You can just look at the generated machine code to check that suspicion?

Looks identical to me, but I’ve had a long day!

julia> @code_native perf_sort(copy(list), alg = HopeRollSort1)
	.text
; ┌ @ REPL[22]:1 within `perf_sort##kw'
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$40, %rsp
	movq	%rsi, 32(%rsp)
	movq	16(%rsi), %r12
	movl	$50000, %ebx                    # imm = 0xC350
	movabsq	$jl_system_image_data, %rbp
	movabsq	$"fpsort!", %r13
	movabsq	$jl_system_image_data, %r14
	leaq	8(%rsp), %r15
	nop
; │ @ REPL[22]:2 within `perf_sort##kw'
; │┌ @ REPL[22]:3 within `#perf_sort#1'
; ││┌ @ sort.jl:705 within `sort!##kw'
; │││┌ @ sort.jl:717 within `#sort!#8'
; ││││┌ @ sort.jl:1189 within `sort!'
L64:
	movq	%r12, 8(%rsp)
	movabsq	$140153198412232, %rax          # imm = 0x7F77F59B01C8
	movq	%rax, 16(%rsp)
	movq	%rbp, 24(%rsp)
	movq	%r14, %rdi
	movq	%r15, %rsi
	movl	$3, %edx
	callq	*%r13
; ││└└└
; ││┌ @ range.jl:674 within `iterate'
; │││┌ @ promotion.jl:410 within `=='
	decq	%rbx
; ││└└
	jne	L64
; │└
; │ @ REPL[22] within `perf_sort##kw'
	movabsq	$jl_system_image_data, %rax
; │ @ REPL[22]:2 within `perf_sort##kw'
	addq	$40, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; └

julia> @code_native perf_sort(copy(list), alg = HopeRollSort2)
	.text
; ┌ @ REPL[22]:1 within `perf_sort##kw'
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$40, %rsp
	movq	%rsi, 32(%rsp)
	movq	16(%rsi), %r12
	movl	$50000, %ebx                    # imm = 0xC350
	movabsq	$jl_system_image_data, %rbp
	movabsq	$"fpsort!", %r13
	movabsq	$jl_system_image_data, %r14
	leaq	8(%rsp), %r15
	nop
; │ @ REPL[22]:2 within `perf_sort##kw'
; │┌ @ REPL[22]:3 within `#perf_sort#1'
; ││┌ @ sort.jl:705 within `sort!##kw'
; │││┌ @ sort.jl:717 within `#sort!#8'
; ││││┌ @ sort.jl:1189 within `sort!'
L64:
	movq	%r12, 8(%rsp)
	movabsq	$140153198412320, %rax          # imm = 0x7F77F59B0220
	movq	%rax, 16(%rsp)
	movq	%rbp, 24(%rsp)
	movq	%r14, %rdi
	movq	%r15, %rsi
	movl	$3, %edx
	callq	*%r13
; ││└└└
; ││┌ @ range.jl:674 within `iterate'
; │││┌ @ promotion.jl:410 within `=='
	decq	%rbx
; ││└└
	jne	L64
; │└
; │ @ REPL[22] within `perf_sort##kw'
	movabsq	$jl_system_image_data, %rax
; │ @ REPL[22]:2 within `perf_sort##kw'
	addq	$40, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; └
2 Likes

because your CPU is warmer after running 1?

1 Like

I ran 1 2 1 2, and both 2s took longer than both 1s

This doesn’t have anything to do with one function being advantaged over another, it’s probably the abstraction(s) of technologies your computer uses leaking up the stack.

When code is compiled, it’s placed somewhere in memory. Depending on where it’s placed, it can be accessed faster or slower (e.g. because it’s placed closer to where the benchmarking code resides). This isn’t something that’s exposed through julia and hence the abstraction of the julia runtime/context of execution is leaking that low level detail of how code is placed in and loaded from memory (and maybe looked up during benchmarking as well). I’m almost certain if you place HopeRollSort1Alg after HopeRollSort2Alg, you could sometimes observe that HopeRollSort2Alg is faster.

At the moment, julia does not try to reorder compiled code based on how often it’s called, as far as I know.

If you know where and how to look, you can find these kinds of leaky abstractions all over the place - and how Joel Spolsky put it:

All non-trivial abstractions, to some degree, are leaky.

4 Likes

Boy is this turning into an interesting thread. I cannot reproduce the performance discrepancy you are observing. I’m on an intel i5-11600K on an official linux build, and my timings for the first two @benchmarks were

 Time  (mean ± σ):   3.163 s ± 44.875 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
[...]
 Time  (mean ± σ):   3.181 s ± 69.147 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

which, again, assuming the BenchmarkTools is doing the sane thing and giving you the estimated standard deviation for the estimator of the mean, are very much equal within error. I would definitely conclude that those have the same performance on my machine.

@jling also gets at this and I acknowledge your response, but just out of curiosity, is it possible that you are on a laptop with stressed thermals or something? Like, are you experiencing periodic throttling or something?

1 Like

I’m on a Dual-Core Intel Core i5 on macOS 11.4 (20F71)

I don’t see how periodic throttling could explain the results from my second MWE. It repeatedly selects which function to benchmark at random—recording the measurement and which function was measured. The results I got (included with the MWE in my post above) were an accumulated 14 measurements of one function in the range 0.000358 … 0.000359 and 18 measurements of the other in the range 0.000278 … 0.000282. Given the random sample order I don’t think it’s possible this is a result of periodic throttling unless the choice of function triggers the throttling.

I think nonetheless your question raises a very good point. Most of the time the reason we use BenchmarkTools is not because we want to know how fast is A but rather if A is faster than B and by how much. A very good addition in my opinion to BenchmarkTools would be a macro to compare A vs B vs… X instead us guessing if one is faster than the others based on their statistics. This macro would also allow for internal bias reduction (reloading A and B and…, etc.) and running such macro for a long time should account as well for the whole machine/OS potential bias.

For now, going to the statistics on statistics idea we can run for a long time things like:

A() = for i in 1:1000 sin(i) end
B() = for i in 1:999 sin(i) end

using BenchmarkTools

n = 10
d = []
for i in 1:n
    println(i)
    push!(d,(@belapsed A()) - (@belapsed B()))
end

And then analyze results in ways like:

using HypothesisTests, Statistics, Plots
plot(d,label="A-B")
plot!([0], seriestype = :hline, label= "H0 μ=0")
plot!([mean(d)], seriestype = :hline, label = "μA - μB")

julia> pvalue(OneSampleTTest(mean(d),std(d),n))
0.004007649672446793

image

Also, since you’re working on this and you raised this problem, if I may I would suggest for you to open an issue to the BenchmarkTools team. I’d support the idea of a macro like @benchmark A B …

3 Likes

FWIW, I don’t think it’s necessarily representative if you only have 14-18 measurements. The benchmarks posted by @viraltux earlier had 10.000 measurements, allowing for a much more representative sample on their machine.

1 Like

Good idea! Feature Request: `@benchmark f() g()` · Issue #239 · JuliaCI/BenchmarkTools.jl · GitHub

I’m worried that this would also report a very low p value for my two identical functions for the reason @Sukera mentioned.

I must not have been clear. Each of my 14+18 trials consists of 10_000 measurements. Before posting, I also ran the same program multiple times. In the longest run got a total of 84 trials consisting of 10_000 measurements each and spanned several minutes of wall time. The pattern I reported (the fastest runtime of one is slower than the slowest runtime of the other) was also present in that run.

1 Like

When I ran the example above with A() and B() being equal I had very high p-values, not sure for your specific example though.

I also wonder about @Sukera comments on how the functions are loaded, if there was an easy way to reload functions randomly in memory it’d be great.

So BenchmarkTools has judge to compare the results of two benchmarks. See Manual · BenchmarkTools.jl

One challenge is that you want to compare distributions of measurments and they are often multi-modal and non-gaussian.

4 Likes