Do you have a reference for nasal demons? Not sure what they are.
My simple baseline method (without an issue) also uses StaticArrays.
Do you have a reference for nasal demons? Not sure what they are.
My simple baseline method (without an issue) also uses StaticArrays.
Do you think we need to account for memory layout issues in benchmark tooling?
Iām guessing thatās a humorous typo of ānastyā.
Again though, I wasnāt talking about the high level process scheduler (i.e. what nice
would affect) but the low level CPU scheduler that is making things like clock speed choices, latency optimization, and big/little core choices. I donāt think what weāre seeing here is related to interrupts or task switching (That wouldnāt explain why the outlier is a fast run, not a slow run).
Maybe Iām missing something, but wouldnāt we expect this to also be affected by similar memory layout effects if that was the culprit?
Sorry, nasal demons is a overly cutesy reference to undefined behaviors.
Personally, I think that it would be really cool to account for it. But I also think that itās a very hard problem thatād require significant effort from someone who is knees deep in the Julia compiler and LLVM. Someone who has plenty of other things to do (and that Iād even more-so like to see be done). And Iām still not 100% convinced that this is a bigger problem than any of the other sources for variability mentioned here.
Sounds great and I voted yes. (Iād also vote yes for free ponys though). Unfortunately discourse polls arenāt often the best way to get things like this built (or to get free ponys distributed). But Iāll continue to vote just in case
Iāve posted at least 4 images of the issue. In one the outlier was slow, and in another it was a bit of both.
There wasnāt an option for nuance/other. Stabilizer doesnāt sound like something youād want to tack onto a benchmarking library directly because it transforms the LLVM compiler to add the randomization instructions. That means weād compile a typical version or a randomized version with some tweaked signature (latter probably done by some macro call), but thereās not a whole lot of point to keeping the second one around when benchmarking is done. It seems simpler to do the randomized benchmarking in a separate process where everything is randomized by a compiler option, it wonāt just affect benchmark tools. The setup
option of BenchmarkTools already lets you rerun code between samples, and that should have the effect of shuffling some code and heap addresses (so far weāve only seen a benchmark for the latter). But anything beyond that should be an orthogonal compiler option IMO.
Apparently this just controls user-space priority, and there are higher priorities for realtime processes. No idea why there isnāt just 1 scale.
I wasnāt implying any specific solution. I really just wanted to know how many people, having read the thread, had been convinced that this was a problem with benchmarking, and would like a fix, assuming that were even possible.
The reason I asked is that thereās been a lot of pushback on the idea that my results were even caused by memory layout, so I wanted to know what proportion of people still believe itās not an issue at all.
Perhaps some people are in the same situation as I am. Here is my set of beliefs at the moment, after having skimmed the conversation:
do we need to account for CPU heating up and some hardware doesnāt have proper cooling for sustained workload in BenchmarkTools.jl?
Btw do you know this is not just cores heating up?
I thinks itās clear it isnāt, given the patterns of various results (given earlier).
https://julialinearalgebra.github.io/BLASBenchmarksCPU.jl/dev/turbo/
Talks a bit about disabling CPU frequency scaling on linux.
You seem to have a prior belief that memory layout is a major contributor. I personally believe that you will convince more people if you disable all frequency scaling pin all cores to max frequency and let your CPU come to a constant temperature with the fan running full throttle for a couple minutes.
Iāve seen some very bizarre stuff associated with all those issues in latency benchmarks on packet processing in other forums (OpenWrt). Basically until the thermal and frequency scaling situation is forced into a steady state before starting the benchmark its just really hard to account for all that stuff.
Also Iād want to see threads pinned to cores, doable in Linux with ThreadPinning.jl but Iām not sure if it works on MacOS.
Definitely concur with these opinions. I watched the video, reviewed the benchmarks here, and have followed this conversation. It seems completely plausible that memory layout is a contributing factor, but Iām not seeing any unambiguous evidence that itās predominantly responsible for the cited anomaly.
Itās an interesting premise and having a benchmarking tool that could account for this effect would be great. The only remaining question then is who will be doing the work to build it? If thereās no way to achieve this effect without dipping into the compiler or LLVM behavior then this is likely to drive a lot of long-term maintenance burden, a la what the developers of Cthulhu.jl go through. Maybe thereās some other way to achieve this in pure Julia? Iām sure the BenchmarkTools maintainer would be fully willing to entertain a Pull Request contributing a working version of this code, @user664303. It sounds like youāve got some understanding of the underlying issues, so why not take a stab at it? Open Source thrives when a community works together to solve problems.
Iāve got some understanding of the underlying causes, yes. I donāt have an understanding of how to randomize the code and data memory layout, currently. Iāve emailed the current maintainer of the Stabilizer tool. Iāll see if that gets me anywhere. I make no promises.
A positive outcome of this post would have been to make people aware of the issue. But it doesnāt really sound like Iāve achieved that.
On these points, these are my current beliefs:
The following is a complete list of command-line switches available when launching julia (a ā*ā marks the default value, if applicable; settings marked ā($)ā may trigger package precompilation):
ā¦
-O
,--optimize={0,1,2*,3}
Set the optimization level (level is 3 if-O
is used without a level) ($)
ā¦
I think itās fair to conclude the opposite: that your post has definitely driven significant awareness that memory alignment can affect performance. Thereās 100+ posts in this thread and a number of people here, myself included, had admitted to watching the video. In terms of bringing attention to a subject, thatās a home run. My understanding is that, despite a number of long-term contributors listed, BenchmarkTools has very few active maintainers, and probably none with this level of bandwidth available.
Thereās still some contention over whether the provided MWE offers conclusive evidence for the thesis that memory alignment alone is responsible for significant run-to-run variance, but thatās just part of the scientific process. Think of it as a badge of honor that so many have taken sufficient interest to even run your MWE. Unexpected results get people off the sidelines. Regardless of whether there is consensus, the crux of this issue is what comes next. Somebody needs to actually implement it.
Itās possible that the authors of Stabilizer will be interested, but I wouldnāt count on it, at least directly. Neither has any Julia projects in their GitHub accounts, so just getting them up to speed and involved in developing for Julia would be its own hurdle. Theyāre also academics who published on this topic years ago and havenāt actively maintained it in the better part of a decade; circling back to well-trod ground often isnāt particularly interesting to this demographic. Your best bet there might be to suggest it as a potential project for one of their undergrad/grad students to work on.
Iād certainly be rooting for you, or anyone else, interested in bringing this vision to life.
This was not my point.