Identical functions repeated benchmarks show systematic differences

When I ran the example above with A() and B() being equal I had very high p-values, not sure for your specific example though.

I also wonder about @Sukera comments on how the functions are loaded, if there was an easy way to reload functions randomly in memory it’d be great.

So BenchmarkTools has judge to compare the results of two benchmarks. See Manual · BenchmarkTools.jl

One challenge is that you want to compare distributions of measurments and they are often multi-modal and non-gaussian.

4 Likes

For my specific example it (sometimes) reports a very low p-value.
I generated the data with:

using Base.Order
import Base.sort!

struct HopeSort1Alg <: Base.Algorithm end
const HopeSort1 = HopeSort1Alg()
struct HopeSort2Alg <: Base.Algorithm end
const HopeSort2 = HopeSort2Alg()

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering)
	v
end

function sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering)
	v
end

using BenchmarkTools

list = rand(50000)

n = 10
d = []
for i in 1:n
    println(i, d)
    push!(d,
		(@belapsed sort!(x; alg=HopeSort1) setup = (x=copy($list))) -
		(@belapsed sort!(x; alg=HopeSort2) setup = (x=copy($list))))
end

And analyzed as you did:

using HypothesisTests, Statistics, Plots
plot(d,label="A-B")
plot!([0], seriestype = :hline, label= "H0 μ=0")
display(plot!([mean(d)], seriestype = :hline, label = "μA - μB"))

julia> pvalue(OneSampleTTest(mean(d),std(d),n))
3.180291383820311e-5

@vchuravy judge reports this same false conclusion.

m1 = @benchmark sort!(x; alg=HopeSort1) setup = (x=copy($list));
m2 = @benchmark sort!(x; alg=HopeSort2) setup = (x=copy($list));
judge(minimum(m1), minimum(m2))
BenchmarkTools.TrialJudgement: 
  time:   -19.69% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

rerun to be safe:

m1 = @benchmark sort!(x; alg=HopeSort1) setup = (x=copy($list));
m2 = @benchmark sort!(x; alg=HopeSort2) setup = (x=copy($list));
judge(minimum(m1), minimum(m2))
BenchmarkTools.TrialJudgement: 
  time:   -26.58% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

@vchuravy, do you have thoughts on this:

I think I might have found a workaround based on @Sukera’s information and your results. As it turns out there is a deepcopy function which “Create a deep copy of x: everything is copied recursively, resulting in a fully independent object.” I thought this might solve our memory allocation problem and it seems it does. Check out this new version:

using Base.Order
import Base.sort!

struct HopeSort1Alg <: Base.Algorithm end
struct HopeSort2Alg <: Base.Algorithm end
HopeSort1 = HopeSort1Alg()
HopeSort2 = HopeSort2Alg()
sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering) = v
sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering) = v

using BenchmarkTools

list = rand(50000)

n = 10
d = []
for i in 1:n
    println(i, d)
    push!(d,
		(@belapsed deepcopy(sort!(x; alg=deepcopy(HopeSort1))) setup = (x=deepcopy($list))) -
		(@belapsed deepcopy(sort!(x; alg=deepcopy(HopeSort2))) setup = (x=deepcopy($list))))
end

And if we check results:

using HypothesisTests, Statistics, Plots
plot(d,label="A-B")
plot!([0], seriestype = :hline, label= "H0 μ=0")
plot!([mean(d)], seriestype = :hline, label = "μA - μB")
julia> pvalue(OneSampleTTest(mean(d),std(d),n))
0.3110983331509956

image

I ran this a couple of times having large p-values as expected :slight_smile:

1 Like

@viraltux
I like how you reformatted the MWE, thanks! I’m sorry this issue is so hard to replicate. I ran your code and got a very low p-value. I then ran it around 6 more times separated by Julia restarts and got a high value every time. Suspecting what I had done previously in the Julia session of the first run, I created this slight modification

using Base.Order
import Base.sort!

struct HopeSort1Alg <: Base.Algorithm end
struct HopeSort2Alg <: Base.Algorithm end
HopeSort1 = HopeSort1Alg()
HopeSort2 = HopeSort2Alg()
sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort1Alg, o::Ordering) = v
sort!(v::AbstractVector, lo::Integer, hi::Integer, a::HopeSort2Alg, o::Ordering) = v

using BenchmarkTools

list = rand(50000)

@btime sort!(x; alg=HopeSort1) setup = (x=copy($list)) #NEW
@btime sort!(x; alg=HopeSort2) setup = (x=copy($list))

n = 10
d = []
for i in 1:n
    println(i, d)
    push!(d,
		(@belapsed deepcopy(sort!(x; alg=deepcopy(HopeSort1))) setup = (x=deepcopy($list))) -
		(@belapsed deepcopy(sort!(x; alg=deepcopy(HopeSort2))) setup = (x=deepcopy($list))))
end

using HypothesisTests, Statistics, Plots
plot(d,label="A-B")
plot!([0], seriestype = :hline, label= "H0 μ=0")
display(plot!([mean(d)], seriestype = :hline, label = "μA - μB"))
display(pvalue(OneSampleTTest(mean(d),std(d),n)))

Which yielded p = 8.445338779118404e-7 and


Running the modification 2 more times restarting Julia between produced similar results:
p=6.941367135055767e-6

and
p=1.6727970943169768e-6

I suspect that your solution fails in a “polluted” environment where one function is advantageously placed in memory because it does not randomly replace the function. I tried doing just that by deleting it and redefining the methods each iteration, and have yet to find a case where this declares the two functions dissimilar (even in a Julia session where another method does declare them dissimilar).

I get p values along the lines of .3–.7 with graphs like this

I’m still unsure why your method does work consistently in a fresh Julia session and what the notion of “polluted” actually means.

Note that you want to make sure that you only have one evaluation per sample if you use setup like this, or you will have evaluations where the data is already sorted. For example:

julia> setup_cnt, run_cnt = 0, 0
(0, 0)

julia> @benchmark (global run_cnt += 1) setup=(global setup_cnt+=1)
BenchmarkTools.Trial:
...
  --------------
  samples:          10000
  evals/sample:     998

julia> setup_cnt
11002

julia> run_cnt
10480502

This is mentioned in the manual:

Note that the setup and teardown phases are executed for each sample, not each evaluation . Thus, the sorting example above wouldn’t produce the intended results if evals/sample > 1 (it’d suffer from the same problem of benchmarking against an already sorted vector).

You can set evals=1 to fix this:

julia> @benchmark (global run_cnt += 1) setup=(global setup_cnt+=1) evals=1
...

julia> setup_cnt
10001

julia> run_cnt
10001
3 Likes

That’s a good point, however in her example there is no actual sorting; it just returns the same values and that’s why we all would expect to see no systematic bias towards one function versus the other.

I think it means the Machine/OS for the following; last night I tried a couple of times the deepcopy approach and worked just fine. I let the terminal on overnight, untouched, and no other operations running in Julia. This morning after reading your message I ran again the code and the bias appeared again, since nothing was going on in Julia I imagine it has to be the OS.

Now I am running Julia with privileges in the OS to see if this makes any difference:

sudo nice -n -10 ionice -c 1 -n 1 su -l -c julia [user]

I ran a few times the code (using @kristoffer.carlsson tip on evals=1 for a good measure) and no bias is shown; which now we can expect in a fresh started Julia. I will nonetheless let the terminal on for a few hours and see if the nice|deepcopy combination fixes the bias in that time frame.

Anyway, at least we have the fresh Julia | deepcopy workaround for the time being. Hopefully the BenchmarkTools team will follow up on your request to properly compared two algorithms.

UPDATE
After about 12 hours running Julia in the same session with the above nice settings in a Linux 5.8.0-59-generic #66~20.04.1-Ubuntu running all sort of software in parallel including other Julia processes, all benchmarks performed with the deepcopy approach showed no bias with p-values between .05 and .9.

1 Like

Having read through this thread again, I’ve noticed that in basically all benchmarks posted here the setup looked something like this:

list = rand(n)
.
.
.
@benchmark [...] setup=(x=copy(list)) # or deepcopy(list)

This will skew the benchmark heavily, since it’s always the same list that is being sorted and the branch predictor in your CPU will learn the patterns in that data.

I don’t think the benchmarks posted here support the conclusions drawn because of that fact alone (not to mention that they also still don’t account for the code being placed in different memory locations, which may influence things as well). Statistics over statistics are meaningless if the statistics you’ve collected are wonky in the first place.

2 Likes

Can I just check that my post above claiming that the functions are - from the perspective of a CPU executing them - identical, and therefore any observed differences must be benchmarking artefacts? I actually know very little about these low level things, but Julia gives people like me a dangerous amount of power to simply look at things like assembly code, so I’m just wondering whether my conclusion makes sense.

1 Like

Both assembly codes you posted contain this callq (which basically is an assembly level function call), so the comparison is meaningless. I.e. both codes just contain that small setup for calling alg. I think the code contains that call because alg is a keyword argument and I think because of that the compiler doesn’t even try to inline it. Additionally, code_native also always specialises its arguments, though it shouldn’t make too much of a difference here.

That being said, there’s no reason that the called code should be materially different - they’re the same function after all.

3 Likes

Identical other than the hard-coded value in the movabsq… but I think it may be more accurate to say “hardware artifacts” than “benchmarking artifacts” – the hardware itself (not to mention the OS) is chock full of mutable state (i.e., what is present in all the caches, the TLB, the branch predictors, temperatures, voltages, variable clock speeds, etc., etc.) , so some variation here is not surprising to me.

4 Likes

I think we might be missing an important point here @Sukera, for the problem at hand how fast is the function is irrelevant; when comparing two identical functions, benchmarks should not be systematically different regardless the settings in @benchmark. Consider the settings below:

x=@benchmark A() setup(<setup that makes no sense to check how fast is A>)
y=@benchmark A() setup(<setup that makes no sense to check how fast is A>)

As long as x and y has the same setup we should not observe any systematic differences between the two, and if the settings could be responsible for this, then we would have two problems; the systematic error introduced by the machine/OS and the one introduced by BenchmarkTools.

It does not matter if the CPU is a genius CPU or the most smart CPU in planet CPU when we are systematically comparing the exact same function, and neither it matters:

As a matter of fact, this is the first answer that I gave in this post -machine/OS… etc, but then @Lilith went on and said something in the lines “sure, but this thing is systematic” and that caught my attention because it should not be.

Using deepcopy in combination with high privileged nice settings in the machine/OS described above made the systematic error disappear, which means that BenchmarkTools can do better when it comes to comparing two algorithms instead just providing independent benchmarks and let the user to figure out which one is faster (and no, the judge function in BenchmarkTools does not fix the issue).

@Lilith I just renamed this Topic from
Identical functions benchmark differently
to
Identical functions repeated benchmarks show systematic differences

I think this might prevent the community from posting straight answers in the lines “that’s what’s expected because machine/OS…” and instead they may consider the systematic part of the Topic. Feel free to update the Topic name again though if you prefer something different.

1 Like

That’s kind of the point here though, they’re not the “exact same function”. They’re two distinct functions with the exact same syntax, at which point all the arguments about how functions are cached and how the code interacts with the OS come into play.

I agree that the two should not be noticeably different, as long as they both have the same level playing field (i.e., same constraints on inputs and doing real work) and as long as the settings given to @benchmark are identical. This may be a little subtle, but by giving a wonky setup to @benchmark, you’ve accidentally removed the level playing field. At that point, what you’re timing is no longer the inherent speed of the function (which I’m very certain still run in the same time) but rather you’re timing that systemic difference you’re claiming to be an artifact of BenchmarkTools itself (which is not the case - it stems from the setup it’s been given and failing to account for that).

I agree insofar as we shouldn’t be able to observe them, but that’s only true if we’re in an isolated environment without an OS-level scheduler, with a single-process system on a very dumb machine without branch prediction etc (i.e. a PDP-11). The reality is though that we aren’t, so we have to try to account for those inherent differences by carefully orchestrating our benchmark and using our tools to their fullest extent possible: providing random matrices (to account for differences in branch prediction, load times from different locations in memory, garbage collection…), controlling how often they’re used per evaluation (to account for CPU-level caches, which you often can’t take advantage of in a real-life benchmark) and so on. Sure, BenchmarkTools could try to account for that by itself, but at that point I think there’s too much magic going on to be able to rely on those microbenchmarks at all. There’s already too much handwavy @btime going on as-is, and trying to guess what the user wanted in the first place is going to make it much harder to compare different invocations, not easier. It’s better to be explicit than implicit.

No, this means that, as a user of BenchmarkTools, you have to know what you’re doing to produce reliable benchmarks and on top of that have to know what you want to benchmark and how to achieve that. The easy-to-use interface of @benchmark and @btime is a tricky beast, because it lulles us into thinking we can just slap @benchmark A() in our code, check the numbers, conclude something and that we thus understand benchmarking when in fact we don’t (and possibly haven’t accounted for those systemic differences you like to point to).

BenchmarkTools can’t guess what you intended to measure. All it can do is run your code a number of times and give you some statistics about how long it took. As much as I’d like to have automatic interpretation of results (wouldn’t that be something!), it’s still up to the user to know what those numbers actually mean and where they could come from. I personally don’t use judge for precisely that reason, because it can’t tell me on a per-function basis what’s going on with micro to nanosecond differences, but it’s important to note that judge is not intended for that puprpose either. It can only tell me that there is a difference, but not where it comes from or why it’s there in the first place. It’s much more useful when you have a large suite of benchmarks and want to check whether a new change leads to a large regression or improvement outside of OS overhead (which is how you’re supposed to use it).


Most of the time, the systemic differences stemming from the OS/machine don’t matter, because they’re the same for all functions everywhere. Statistically, there’s some distribution for how much overhead a given function invocation gets from the OS. By changing nice values (i.e. affecting how the OS schedules your code and giving it a higher preference on your multi-process machine, leading to better cache coherence) you can shift that distribution in your favor, sure, but that’s not representative of a typical workload. After all, if every program could be sped up this way, everything would have maximal nice values all the time, at which point noone would have preferential treatment anymore and you’d be back to square one.

6 Likes

I don’t think anybody is claiming BenchmarkTools is introducing any artifacts, rather that BenchmarkTools do not account for the Machine/OS artifacts when comparing benchmarks.

What the users typically may want when using BenchmarkTools is to answer these questions:

  1. Is A faster than B?
  2. By how much?
  3. How confident are we on those differences?

None of these questions are directly addressed by BenchmarkTools as of today.

judge cannot tell you if there is a real difference and can not even tell you in which direction the real difference goes.

My point though using the deepcopy/nice settings is not to prove that that’s the best way to do things -in fact I thought about a few other ways do diminish the effect of Machine/OS artifacts that I didn’t use in that example, but rather that yes we can reliably detect (or in this case not to detect if there are none) those micro differences.

Anyway, this conversation was really interesting and I have learnt a lot, I found specially useful your link to leaky abstractions thank you again @Sukera . However since this Topic has been resolved on the reasons why there are systematic differences for the same function, I think it’s best that I end my participation in this thread and perhaps I open a new Topic in the future with very specific examples of what I am trying to convey.

1 Like

I would really appreciate that, thanks in advance!

1 Like

I think it can already do that today (partially with judge), but (as with any benchmark) all of these are going to be limited to “…on this machine at the time of running, with possible effects stemming from the environment”. No benchmarking tool can objectively say “function A is always faster than function B”, because that question is meaningless without its execution context! Even worse, it’s possible that on machine M_1 function A is faster, while on machine M_2 function B is faster, for the exact same code. This can be due to a number of effects, but this alone makes it necessary to take the machine, OS & other environment into account when running the benchmark (and accepting that some sources of noise are infeasible to account for, e.g. location of compiled code in memory, incoming radiation causing bitflips, etc).

That depends on what you mean with “real” here. As we’ve established above, there’s no “real” (or objective) time any given function should take. By its very nature, “benchmarking execution time” as a thing to do is multivariate in (at least) the function in question as well as its execution environment (which you can’t ever eliminate completely, since the function may change with the execution environment due to different CPU capabilities, differences in RAM etc).

Worse, some problems may occur even if you’re in a (algorithmically perfect) computer that does what you want it to do, i.e. it runs the best possible code to solve the problem you intended to measure. One example of that is Shouting in the datacenter (WARNING: Loud!), a famous example of “well ok, looks like our code isn’t the problem”. How would BenchmarkTools be able to account for that? We can’t even control our environment to the degree necessary to provide exact carbon copies of each others execution environments, so trying to account for them (instead of letting the noise influence the benchmark and accepting it to be there while trying our best to make sure we measure what we intend to measure) is, in some sense, futile.

Please do! I always enjoy sharing what I’ve learned over the years about how computers work and I hope I haven’t come across as too lecturing here :slight_smile: I know it’s tempting to think of computers as these deterministic computation machines, but in reality it’s a marvel that they work at all. One last favorite of mine I’d like to share here is Indistinguishable From Magic: Manufacturing Modern Computer Chips, a (by now somewhat dated) talk about what’s (sort of) going on at the transistor level. I hope it’s not too technical, but it gives a good overview of what the landscape at the smallest level looked like ~12 years ago.

4 Likes