Making @benchmark outputs statistically meaningful, and actionable

Do you have a reference for nasal demons? Not sure what they are.

My simple baseline method (without an issue) also uses StaticArrays.

Do you think we need to account for memory layout issues in benchmark tooling?

  • Yes!
  • No!
0 voters

Iā€™m guessing thatā€™s a humorous typo of ā€œnastyā€.

Again though, I wasnā€™t talking about the high level process scheduler (i.e. what nice would affect) but the low level CPU scheduler that is making things like clock speed choices, latency optimization, and big/little core choices. I donā€™t think what weā€™re seeing here is related to interrupts or task switching (That wouldnā€™t explain why the outlier is a fast run, not a slow run).

Maybe Iā€™m missing something, but wouldnā€™t we expect this to also be affected by similar memory layout effects if that was the culprit?

1 Like

Sorry, nasal demons is a overly cutesy reference to undefined behaviors.

Personally, I think that it would be really cool to account for it. But I also think that itā€™s a very hard problem thatā€™d require significant effort from someone who is knees deep in the Julia compiler and LLVM. Someone who has plenty of other things to do (and that Iā€™d even more-so like to see be done). And Iā€™m still not 100% convinced that this is a bigger problem than any of the other sources for variability mentioned here.

2 Likes

Sounds great and I voted yes. (Iā€™d also vote yes for free ponys though). Unfortunately discourse polls arenā€™t often the best way to get things like this built (or to get free ponys distributed). But Iā€™ll continue to vote just in case :wink:

2 Likes

Iā€™ve posted at least 4 images of the issue. In one the outlier was slow, and in another it was a bit of both.

There wasnā€™t an option for nuance/other. Stabilizer doesnā€™t sound like something youā€™d want to tack onto a benchmarking library directly because it transforms the LLVM compiler to add the randomization instructions. That means weā€™d compile a typical version or a randomized version with some tweaked signature (latter probably done by some macro call), but thereā€™s not a whole lot of point to keeping the second one around when benchmarking is done. It seems simpler to do the randomized benchmarking in a separate process where everything is randomized by a compiler option, it wonā€™t just affect benchmark tools. The setup option of BenchmarkTools already lets you rerun code between samples, and that should have the effect of shuffling some code and heap addresses (so far weā€™ve only seen a benchmark for the latter). But anything beyond that should be an orthogonal compiler option IMO.

Apparently this just controls user-space priority, and there are higher priorities for realtime processes. No idea why there isnā€™t just 1 scale.

1 Like

I wasnā€™t implying any specific solution. I really just wanted to know how many people, having read the thread, had been convinced that this was a problem with benchmarking, and would like a fix, assuming that were even possible.

The reason I asked is that thereā€™s been a lot of pushback on the idea that my results were even caused by memory layout, so I wanted to know what proportion of people still believe itā€™s not an issue at all.

Perhaps some people are in the same situation as I am. Here is my set of beliefs at the moment, after having skimmed the conversation:

  • Memory layout might play a role in benchmarking variability
  • So can lots of other things that we wonā€™t control anyway
  • Changing that would require a lot of effort for unclear benefits
10 Likes

do we need to account for CPU heating up and some hardware doesnā€™t have proper cooling for sustained workload in BenchmarkTools.jl?

Btw do you know this is not just cores heating up?

I thinks itā€™s clear it isnā€™t, given the patterns of various results (given earlier).

https://julialinearalgebra.github.io/BLASBenchmarksCPU.jl/dev/turbo/
Talks a bit about disabling CPU frequency scaling on linux.

3 Likes

You seem to have a prior belief that memory layout is a major contributor. I personally believe that you will convince more people if you disable all frequency scaling pin all cores to max frequency and let your CPU come to a constant temperature with the fan running full throttle for a couple minutes.

Iā€™ve seen some very bizarre stuff associated with all those issues in latency benchmarks on packet processing in other forums (OpenWrt). Basically until the thermal and frequency scaling situation is forced into a steady state before starting the benchmark its just really hard to account for all that stuff.

Also Iā€™d want to see threads pinned to cores, doable in Linux with ThreadPinning.jl but Iā€™m not sure if it works on MacOS.

3 Likes

Definitely concur with these opinions. I watched the video, reviewed the benchmarks here, and have followed this conversation. It seems completely plausible that memory layout is a contributing factor, but Iā€™m not seeing any unambiguous evidence that itā€™s predominantly responsible for the cited anomaly.

Itā€™s an interesting premise and having a benchmarking tool that could account for this effect would be great. The only remaining question then is who will be doing the work to build it? If thereā€™s no way to achieve this effect without dipping into the compiler or LLVM behavior then this is likely to drive a lot of long-term maintenance burden, a la what the developers of Cthulhu.jl go through. Maybe thereā€™s some other way to achieve this in pure Julia? Iā€™m sure the BenchmarkTools maintainer would be fully willing to entertain a Pull Request contributing a working version of this code, @user664303. It sounds like youā€™ve got some understanding of the underlying issues, so why not take a stab at it? Open Source thrives when a community works together to solve problems.

2 Likes

Iā€™ve got some understanding of the underlying causes, yes. I donā€™t have an understanding of how to randomize the code and data memory layout, currently. Iā€™ve emailed the current maintainer of the Stabilizer tool. Iā€™ll see if that gets me anywhere. I make no promises.

A positive outcome of this post would have been to make people aware of the issue. But it doesnā€™t really sound like Iā€™ve achieved that.

On these points, these are my current beliefs:

  • Memory layout does play a role in benchmarking variability. This has been demonstrated and explained in several different academic papers and talks. Blog post with a couple of references here.
  • Clock speed and CPU contention can also affect benchmarks. Accounting for these when comparing two implementations for speed is simple and straightforward: interleave multiple runs of the two codes.
  • Accounting for memory layout is not straightforward, and would require a lot of effort to implement a solution. However, the benefits are clear: engineers do not over-fit their implementations based on the memory layout they happened to have when benchmarking. E.g. research has shown that the extra optimizations the LLVM compiler makes going from -O2 to -O3 are statistically insignificant. We can avoid this kind of over-fitting.

We do:

The following is a complete list of command-line switches available when launching julia (a ā€˜*ā€™ marks the default value, if applicable; settings marked ā€˜($)ā€™ may trigger package precompilation):

ā€¦
-O, --optimize={0,1,2*,3} Set the optimization level (level is 3 if -O is used without a level) ($)
ā€¦

I think itā€™s fair to conclude the opposite: that your post has definitely driven significant awareness that memory alignment can affect performance. Thereā€™s 100+ posts in this thread and a number of people here, myself included, had admitted to watching the video. In terms of bringing attention to a subject, thatā€™s a home run. My understanding is that, despite a number of long-term contributors listed, BenchmarkTools has very few active maintainers, and probably none with this level of bandwidth available.

Thereā€™s still some contention over whether the provided MWE offers conclusive evidence for the thesis that memory alignment alone is responsible for significant run-to-run variance, but thatā€™s just part of the scientific process. Think of it as a badge of honor that so many have taken sufficient interest to even run your MWE. Unexpected results get people off the sidelines. Regardless of whether there is consensus, the crux of this issue is what comes next. Somebody needs to actually implement it.

Itā€™s possible that the authors of Stabilizer will be interested, but I wouldnā€™t count on it, at least directly. Neither has any Julia projects in their GitHub accounts, so just getting them up to speed and involved in developing for Julia would be its own hurdle. Theyā€™re also academics who published on this topic years ago and havenā€™t actively maintained it in the better part of a decade; circling back to well-trod ground often isnā€™t particularly interesting to this demographic. Your best bet there might be to suggest it as a potential project for one of their undergrad/grad students to work on.

Iā€™d certainly be rooting for you, or anyone else, interested in bringing this vision to life.

5 Likes

This was not my point.