Making @benchmark outputs statistically meaningful, and actionable

user664303 · September 28, 2023, 6:05pm

Do you have a reference for nasal demons? Not sure what they are.

My simple baseline method (without an issue) also uses StaticArrays.

user664303 · September 28, 2023, 6:06pm

Do you think we need to account for memory layout issues in benchmark tooling?

Yes!
No!

0 voters

Mason · September 28, 2023, 6:16pm

I’m guessing that’s a humorous typo of “nasty”.

Again though, I wasn’t talking about the high level process scheduler (i.e. what nice would affect) but the low level CPU scheduler that is making things like clock speed choices, latency optimization, and big/little core choices. I don’t think what we’re seeing here is related to interrupts or task switching (That wouldn’t explain why the outlier is a fast run, not a slow run).

Maybe I’m missing something, but wouldn’t we expect this to also be affected by similar memory layout effects if that was the culprit?

mbauman · September 28, 2023, 6:18pm

Sorry, nasal demons is a overly cutesy reference to undefined behaviors.

Personally, I think that it would be really cool to account for it. But I also think that it’s a very hard problem that’d require significant effort from someone who is knees deep in the Julia compiler and LLVM. Someone who has plenty of other things to do (and that I’d even more-so like to see be done). And I’m still not 100% convinced that this is a bigger problem than any of the other sources for variability mentioned here.

Mason · September 28, 2023, 6:20pm

Sounds great and I voted yes. (I’d also vote yes for free ponys though). Unfortunately discourse polls aren’t often the best way to get things like this built (or to get free ponys distributed). But I’ll continue to vote just in case

user664303 · September 28, 2023, 7:32pm

I’ve posted at least 4 images of the issue. In one the outlier was slow, and in another it was a bit of both.

Benny · September 28, 2023, 11:13pm

There wasn’t an option for nuance/other. Stabilizer doesn’t sound like something you’d want to tack onto a benchmarking library directly because it transforms the LLVM compiler to add the randomization instructions. That means we’d compile a typical version or a randomized version with some tweaked signature (latter probably done by some macro call), but there’s not a whole lot of point to keeping the second one around when benchmarking is done. It seems simpler to do the randomized benchmarking in a separate process where everything is randomized by a compiler option, it won’t just affect benchmark tools. The setup option of BenchmarkTools already lets you rerun code between samples, and that should have the effect of shuffling some code and heap addresses (so far we’ve only seen a benchmark for the latter). But anything beyond that should be an orthogonal compiler option IMO.

user664303:

sudo nice -n -20 julia
This sets the Julia process to the highest priority possible. Given these two things, the idea that something would interrupt the Julia process seems remote.

Apparently this just controls user-space priority, and there are higher priorities for realtime processes. No idea why there isn’t just 1 scale.

user664303 · September 29, 2023, 2:46pm

I wasn’t implying any specific solution. I really just wanted to know how many people, having read the thread, had been convinced that this was a problem with benchmarking, and would like a fix, assuming that were even possible.

The reason I asked is that there’s been a lot of pushback on the idea that my results were even caused by memory layout, so I wanted to know what proportion of people still believe it’s not an issue at all.

gdalle · September 29, 2023, 7:01pm

Perhaps some people are in the same situation as I am. Here is my set of beliefs at the moment, after having skimmed the conversation:

Memory layout might play a role in benchmarking variability
So can lots of other things that we won’t control anyway
Changing that would require a lot of effort for unclear benefits

jling · September 29, 2023, 7:25pm

do we need to account for CPU heating up and some hardware doesn’t have proper cooling for sustained workload in BenchmarkTools.jl?

Btw do you know this is not just cores heating up?

user664303 · September 29, 2023, 7:57pm

I thinks it’s clear it isn’t, given the patterns of various results (given earlier).

Elrod · September 29, 2023, 9:28pm

https://julialinearalgebra.github.io/BLASBenchmarksCPU.jl/dev/turbo/
Talks a bit about disabling CPU frequency scaling on linux.

dlakelan · September 30, 2023, 12:17am

You seem to have a prior belief that memory layout is a major contributor. I personally believe that you will convince more people if you disable all frequency scaling pin all cores to max frequency and let your CPU come to a constant temperature with the fan running full throttle for a couple minutes.

I’ve seen some very bizarre stuff associated with all those issues in latency benchmarks on packet processing in other forums (OpenWrt). Basically until the thermal and frequency scaling situation is forced into a steady state before starting the benchmark its just really hard to account for all that stuff.

Also I’d want to see threads pinned to cores, doable in Linux with ThreadPinning.jl but I’m not sure if it works on MacOS.

mike.ingold · September 30, 2023, 2:28am

Definitely concur with these opinions. I watched the video, reviewed the benchmarks here, and have followed this conversation. It seems completely plausible that memory layout is a contributing factor, but I’m not seeing any unambiguous evidence that it’s predominantly responsible for the cited anomaly.

It’s an interesting premise and having a benchmarking tool that could account for this effect would be great. The only remaining question then is who will be doing the work to build it? If there’s no way to achieve this effect without dipping into the compiler or LLVM behavior then this is likely to drive a lot of long-term maintenance burden, a la what the developers of Cthulhu.jl go through. Maybe there’s some other way to achieve this in pure Julia? I’m sure the BenchmarkTools maintainer would be fully willing to entertain a Pull Request contributing a working version of this code, @user664303. It sounds like you’ve got some understanding of the underlying issues, so why not take a stab at it? Open Source thrives when a community works together to solve problems.

user664303 · October 1, 2023, 2:06pm

I’ve got some understanding of the underlying causes, yes. I don’t have an understanding of how to randomize the code and data memory layout, currently. I’ve emailed the current maintainer of the Stabilizer tool. I’ll see if that gets me anywhere. I make no promises.

A positive outcome of this post would have been to make people aware of the issue. But it doesn’t really sound like I’ve achieved that.

user664303 · October 1, 2023, 2:22pm

On these points, these are my current beliefs:

Memory layout does play a role in benchmarking variability. This has been demonstrated and explained in several different academic papers and talks. Blog post with a couple of references here.
Clock speed and CPU contention can also affect benchmarks. Accounting for these when comparing two implementations for speed is simple and straightforward: interleave multiple runs of the two codes.
Accounting for memory layout is not straightforward, and would require a lot of effort to implement a solution. However, the benefits are clear: engineers do not over-fit their implementations based on the memory layout they happened to have when benchmarking. E.g. research has shown that the extra optimizations the LLVM compiler makes going from -O2 to -O3 are statistically insignificant. We can avoid this kind of over-fitting.

Benny · October 1, 2023, 2:54pm

We do:

The following is a complete list of command-line switches available when launching julia (a ‘*’ marks the default value, if applicable; settings marked ‘($)’ may trigger package precompilation):

…
-O, --optimize={0,1,2*,3} Set the optimization level (level is 3 if -O is used without a level) ($)
…

mike.ingold · October 1, 2023, 3:01pm

I think it’s fair to conclude the opposite: that your post has definitely driven significant awareness that memory alignment can affect performance. There’s 100+ posts in this thread and a number of people here, myself included, had admitted to watching the video. In terms of bringing attention to a subject, that’s a home run. My understanding is that, despite a number of long-term contributors listed, BenchmarkTools has very few active maintainers, and probably none with this level of bandwidth available.

There’s still some contention over whether the provided MWE offers conclusive evidence for the thesis that memory alignment alone is responsible for significant run-to-run variance, but that’s just part of the scientific process. Think of it as a badge of honor that so many have taken sufficient interest to even run your MWE. Unexpected results get people off the sidelines. Regardless of whether there is consensus, the crux of this issue is what comes next. Somebody needs to actually implement it.

It’s possible that the authors of Stabilizer will be interested, but I wouldn’t count on it, at least directly. Neither has any Julia projects in their GitHub accounts, so just getting them up to speed and involved in developing for Julia would be its own hurdle. They’re also academics who published on this topic years ago and haven’t actively maintained it in the better part of a decade; circling back to well-trod ground often isn’t particularly interesting to this demographic. Your best bet there might be to suggest it as a potential project for one of their undergrad/grad students to work on.

I’d certainly be rooting for you, or anyone else, interested in bringing this vision to life.

user664303 · October 1, 2023, 5:55pm

This was not my point.

Topic		Replies	Views
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2888	August 2, 2021
Why does my program's performance vary so much from run to run? Can it be fixed? Performance question , memory-allocation , optimization	11	1573	January 20, 2021
Fluctuations when measuring execution time of linear algebra code Performance	12	1048	July 19, 2019
Benchmark Tests: Improvements for BenchmarkTools Performance discussion	16	2246	August 12, 2021
PSA: Microbenchmarks remember branch history Performance benchmark	36	5068	October 1, 2019

Making @benchmark outputs statistically meaningful, and actionable

Related topics