Making @benchmark outputs statistically meaningful, and actionable

Ok it’s good you’re not offended, I inferred that from your tone.

And I did notice that you said it “would be a lot of effort” to make an MWE. But you’re in effect asking everyone else to do something similar individually to help you.

Collectively that’s not DRY, so you can probably guess how programmers will feel about it.

5 Likes

Also relevant to this discussion (in case anybody wants to weigh in):

1 Like

I’m not asking that at all. In fact, you’re completely missing the point of my post. I posted to share some knowledge, not get help. I’m asking readers to simply take on board something that Professor of Computer Science, Emery Berger, explains very well in this talk: that memory layout makes a difference, and this needs to be accounted for when running benchmarks. If you haven’t watched and don’t know what I’m talking about, please do watch it, for your own benefit.

You’re welcome.

Incidentally, 127 people have read this post, 7 of you have responded, but only 3 have even clicked the link. I find that mind-blowing.

What you may be missing is we need to be able to see how your problem relates to the video. Its 42 minutes! Who has time for that, we all maintain a million things and have five of these threads going.

You really just need to invest a little more time to convince us to invest some of our time.

Very often people post things here that are mistaken, some evidence in MWE form helps us to clarify if in this case you are in fact on to something.

11 Likes

One potential issue is that BT measures wall clock time and not CPU time/or cycles. So you get to experience OS scheduler variability and your processor down clocking.

I had a PR open to also measure cpu-time, but windows made that an unpleasant experience Measure cpu-time and real-time and report both by vchuravy · Pull Request #94 · JuliaCI/BenchmarkTools.jl · GitHub

6 Likes

My own personal view on posting: If you’re unwilling to accept the premise of a question or post, simply don’t respond. It wasn’t meant for you.

1 Like

I agree that a @bcompare macro would be really useful. Effects of this kind have wasted enough of my time that I think interleaving the workloads (preferably randomly) is the correct thing to do by default. There are a lot of things that can cause these effects, some of them are fixable and some aren’t, but either way it’s extra mental overhead and unnecessary potential for error.

I think anything like stabilizer in Julia would be the cherry on top, but it wouldn’t remove the need for interleaving. For example, it wouldn’t solve the noisy neighbour effect or thermal throttling.

2 Likes

Doesn’t BenchmarkTools already have a judge function for exactly that purpose? You can already compare benchmarks like that. Whether the result is meaningful depends on what you’re benchmarking though, which so far hasn’t been shared in this thread.

3 Likes

To be fair, the judge function only compares results, but here the question seems to be about the conditions in which the benchmarks are run. Of course it is legitimate to wonder just how far we can go to make benchmarks comparable, and how much more code complexity it would incur.

That is why I also invite people to consider package maintainer time in such debates. I was appointed (trapped) as maintainer for BenchmarkTools.jl without asking for it, without having time for it, and without knowing a damn thing about metaprogramming. I’m doing what I can, but let’s face it: if a PR pops up that reaches deep into cache layout or LLVM hacks, I won’t be able to review it.

9 Likes

judge doesn’t interleave the two results, so wouldn’t account for changes in the machine state affecting the two results.

3 Likes

I don’t think anyone is refuting the premise, it’s just without a MWE (and minimum can still be very large), the reason for the variation could be several other things as far as we know. There’s a difference between having an open mind and rejecting all other possible explanations, and I think people are demonstrating the former by asking for a common frame of reference.

The video was informative, but it would have been more accessible to cite 8:03-15:52 specifically to make that point; 42 minutes is a tough ask for people with less free time or slower internet, 8 minutes is more reasonable. It’d also be more accessible to directly link the solution proposed in the video and parroted here, Stabilizer, where there are text and links to papers to read. It does appear to be more of an “LLVM hack”, not something where most of the work can be done in a particular package in any particular language. It’s not actively maintained, so a call to action could be made around this tool than BenchmarkTools, sounds like it’d benefit much more people.

Stabilizer’s randomization also wouldn’t be something you’d want to be active all the time, even for performance profiling, because the randomization has runtime and garbage collection overhead (which usually slows things down but not always). For example, I’d include randomization to gauge whether an edit made a meaningful performance change, but I would exclude it when recording a program’s benchmark without comparison to different programs. It’d be vulnerable to layout variation, but it’s still more realistic than having a randomizer active.

Just for fun, a version of the random sleep benchmark with no allocations, and it satisfies all the conditions you named except for purity.
julia> f2() = Libc.systemsleep(rand(1:10))
f2 (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark f2()
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 7.000 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.

julia> @benchmark f2()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.000 s … 9.000 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.500 s            ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.500 s ± 3.536 s  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                     █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4 s           Histogram: frequency by time          9 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

I believe the point was to demonstrate that large performance variations can have other causes people can insert into their own benchmarks (function purity is especially easy to flub), so a common MWE is important.

9 Likes

Ironically, this sentence fails to accept the premise.

I understood what the point was. I made the counter point that the histograms that I show, and that this shows, are very different. If you can write code that produces two histograms like mine, i.e. clumped, yet completely non-overlapping, then you will have made the point for me. Note too that the benchmark tool ran different numbers of evaluations each time. I didn’t change any system settings between runs.

However, let me add another piece of information. The code is single-threaded, and uses no thread pausing operations of any kind.

Even accepting the point that such variations can have other causes (which I do), that doesn’t mean that sometimes (even often?) the cause is a change of memory layout, and that accounting for that in benchmarks would be a good idea.

You’ll see from the talk (24:30-25:50) that if you take the effect of memory layout into account, then the effect of -O3 optimization over -O2 optimization is indistinguishable from noise. It would be great to have benchmark tools that take this into account, so that we don’t waste time writing and merging improvements that actually aren’t improvements.

This sounds annoying for you. Please don’t take any of my comments here as me suggesting you need to do or review something.

Then excuse me for misunderstanding the premise, but there might be a communication issue on your end because I along with other readers observe the reported variation in benchmark distributions in the absence of any code changes, restarts, or garbage collection, and I believe we are all seriously considering your hypothesis that the cause is memory layout. I can’t speak for anyone else, but I have watched the video you linked and gone on to find the relevant Github repository and read their publication. Yet with this effort, your point still eludes me.

It doesn’t help to drop breadcrumbs of information for people trying and failing to discover your point. Please communicate it clearly and provide a MWE that we can use as a common frame of reference. Otherwise this thread, ironically in your words, “lacks actionability.”

4 Likes

Thanks for sharing the timings. I did look into sharing particular segments, but in the end decided that I wouldn’t prejudice what people watch. I think the whole talk is fascinating, and encourage anyone interested in performance to watch it all. You can set the playback speed - the video is actually completely understandable at 2x playback speed. Then it’s only 20 minutes.

So you’re saying that after the reading the whole of this post, and watching a talk by someone else (a Professor of CS, no less), and reading an academic publication written by two people who aren’t me, the problem is my poor communication? OK. I have a different hypothesis.

Sorry it doesn’t help you. Not everything I write can be of help to everyone. It might help some people. Trying to censor something because it didn’t help you isn’t constructive.

Which bit wasn’t clear? If I know, then maybe I can help.

It’s not irony though, is it? I never claimed that the thread was actionable.

Without a MWE, there is no way to confirm that changes to @benchmark resolve the problem I’m raising. A MWE is required to verify a fix. I acknowledge this.

A MWE is not required to understand and acknowledge the problem. I’m not sure what it would do, other than demonstrate an effect on your computer that has already been claimed by a Professor of CS in a public talk. If your scepticism requires that then so be it.

My reluctance to provide a MWE doesn’t stem from an expectation that other people do the work. It stems from a difficulty in making something reproducible. It’s impacted by hardware, OS and whatever is running on your computer at the time.

I would love to provide an MWE. I’m spending some time on it. More time responding to posts here though…

Disagreement is not equivalent to censorship, and only the admins have the power to censor us.

What your benchmark is doing. That’s why multiple people have asked for a MWE and given reasons why you should provide one, while also acknowledging your opinion that it will be difficult.

My apologies, I was confused by your assertions that there is an “issue” with a “fix” and that you would “take a look at the BenchmarkTools code and try to do [a simple hack]”.

I think a MWE is required for people to understand and acknowledge the problem in your benchmark, actually, that’s what the others have said too. We’re not doubting that memory layout affects performance, but since it’s actually very hard to think up a benchmark that unambiguously demonstrates the effect like yours does, we would very much like to see what yours is doing.

Godspeed on the former.

1 Like

I believe you can see the problem. The two histograms are completely different. Is the issue that you think maybe my code is doing something that can explain this effect some other way than memory layout changing?

Firstly, that is not accepting the premise of the post - the thing I’ve been complaining about. I don’t think it’s constructive. But I don’t want this thread to be about that. Secondly, I’ve tried to explain that the code is not doing anything funky: no threads, no IO, just pure computation. But perhaps you don’t believe me, and want to confirm this? Do you just want to see the code? I can share the code. It’s 100s of lines. People have made quite clear that watching a 40 minute video is too much of an ask. Why would anyone want to trawl through 100s of lines of code they’ve never seen before?

Or you actually want to be able to reliably reproduce the problem? That would be great. Definitely something helpful. Especially for validating a fix. What if you don’t see the same effect on your computer, though? Will you then say it’s a problem with my computer or environment? Do you see the problem?

I personally don’t think you’ve really justified why a MWE is required for people to understand and acknowledge the problem. I’ll repeat my assertion: the only thing it’s required for is validating a fix. Which is a way off, isn’t it.

But maybe this will help either or both of us. I have some code (that I have given some info about) that does a computation and produced those two benchmark results. Please select all the reasons why you need an MWE:

  • To verify that the code itself isn’t causing the difference
  • To believe that memory layout can cause such a difference
  • To believe that memory layout is causing this difference
  • To work on a fix
  • To validate a proposed fix
  • I don’t need a MWE
0 voters

Thank you for voting! The more voters the better.

Following this way overheated thread, this is my view: I appreciate the information in the video and the lessons we can take from it. However, I and many others have been doing benchmarks for a long time in different code, and while I’m not claiming that benchmarks do not fluctuate, I never experienced something like you are reporting (I think). Thus, not only a MWE would be good to actually see the issue in operation, but also to try to understand what kind of code pattern can cause the issue to manifest, and realize how unusual, or usual, it is.

All this is important to project the importance, in general, of an improvement in benchmark tools. Note that an academic video and paper can also demonstrate results on the basis of unusual code layouts. The fact that the problem can exist (and possibly exists in your example), doesn’t mean it is common, and in any case a fix probably would involve understanding what in your (or a MWE) example is causing the issue to manifest so clearly.

15 Likes

Thank you for the helpful response :heart:

I think an interesting result, and one relevant to this comment, is the one I mentioned earlier:

I believe the code they use to evaluate that is all from standard benchmarks, not their own. I.e. very common memory layout patterns.

1 Like