Making @benchmark outputs statistically meaningful, and actionable

user664303 · September 28, 2023, 2:26pm

Again, x86 architecture. I assume that some of you have Apple silicon…

Mason · September 28, 2023, 2:42pm

Is your system apple silicon? That’d make me even less likely to suspect your performance puzzle is related to memory layout (though of course it’s far from ruled out!), since MacOS with apple silicon has a rather complicated and different scheduler than what most x86 chips use (though Intel’s more recent CPUs may have some similar characteristics due to them also having heterogeneous cores). Apple Silicon’s mobile lineage really shows not only in having big and little cores, but also spikey overclocking behaviour, and lots of things where the system is trying to reason about how many resources to give a certain process, and trying to turbocharge what it suspects are short-lived, latency sensitive processes while deprioritizing what it thinks might be more throughput based, long lived processes.

MacOS is also notorious for spinning up and down highly resource intensive background processes.

Benny · September 28, 2023, 2:42pm

On the contrary, the MWE was very enlightening. People were able to attempt it and verify the machine-dependence, and people could suggest changes. I especially appreciate the boxplot graph variant, it’s much clearer than a few @benchmark printouts. The skepticism is just part of fact-finding. Just because people have an open mind about alternative causes and confounding factors doesn’t mean they aren’t exploring the cause you want to address. In fact, all suggestions so far explore memory layout.

mbauman · September 28, 2023, 2:47pm

As a moderator here, I’d like to try to de-escalate a bit. Please consider what want you’d want from this thread. IMV, the amount of engagement you’ve gotten here is great. Yes, there’s skepticism, but people are engaging. And that’s far better than someone tacitly saying “ok, cool story” and moving on.

This is could — and should — be a fun “whodunnit” when approached with an open mind, curiosity, and respect for others. Let’s please put a pin in trying to police the commentary itself.

user664303 · September 28, 2023, 2:51pm

Yes, my system has an M1 Pro CPU.

This issue is not related to scheduling, as I understand, but related to cache line conflicts, branch prediction conflicts and such like.

I think you’re focusing on the prior, and not on the data. I have shown several experiments where there is no significant change in run time across multiple benchmarks. And I have shown several where there is. If these other factors were at play, as you suggest, then I would see this behaviour across every single benchmark I run - not just the ones where I claim there is an issue. I also return to another point I made before - how does your proposition make the change in run time happen precisely between runs of @benchmark, and not during?

Mason · September 28, 2023, 2:52pm

Well, I’d actually disagree a bit there. I mean, that’s what most people here want out of this thread, but it’s somewhat explicitly not what @user664303 wanted out of it, they mostly were posting about how they wanted a better way for BenchmarkTools.jl to account for memory alignment and make more comparable benchmarks.

Most people’s attitude seems to be “yeah that’d be nice but it’s hard enough to do that it’s not really actionable. But anyways, lets dig into what may or may not actually be happening here since there are good reasons you might have mistaken beliefs about the cause of this”

user664303 · September 28, 2023, 2:54pm

user664303 · September 28, 2023, 2:57pm

And on that subject, I did find that someone is maintaining the Stabilizer tool, and posed this question to them.

lmiq · September 28, 2023, 3:00pm

Why code reloading may cause the issue but regenerating the data doesn’t?

user664303 · September 28, 2023, 3:02pm

Please do take the poll. The last one was good.

Mason · September 28, 2023, 3:03pm

I guess I’d just say that as this thread progresses, I’m increasingly less convinced that alignment is the relevant variable here and more convinced that it’s the scheduler, and that’s not because of the prior but because of the data you and others have shown.

Maybe you could help me out by explaining what is making you think this can’t possibly be related to your computer’s rather different scheduler, heuristics, and your OS’s rather different approach to resource intensive background tasks?

Because your CPU scheduler sees those @benchmark runs as separate work units. It can and does make guesses about what to do with an individual function and weighs that infromation against all sorts of opaque things like the current thermal conditions, what other programs are scheduled, whether or not Mercury is in retrograde, etc.

user664303 · September 28, 2023, 3:04pm

Good question. I’d be guessing… Perhaps branch prediction conflicts, perhaps changing the stack address slightly. Who knows.

Mason · September 28, 2023, 3:06pm

Though I should reiterate that I agree with your original premise regardless , because it’d be good if BenchmarkTools.jl were able to isolate things like alignment because it would make it much easier to know whether or not that was causing the problem!

user664303 · September 28, 2023, 3:06pm

Now that were running 20 calls to @benchmark in a loop, how can the scheduler determine that they’re different runs?

user664303 · September 28, 2023, 3:08pm

One reason is that the timing always changes between code reloads. See the point I make above. Another is that my code is intensive and single threaded. There isn’t much else going on on my machine. It will just get thrown onto a high power core and run at full throttle.

user664303 · September 28, 2023, 3:09pm

It’s funny that we’re still talking about my code/results, though, isn’t it.

Mason · September 28, 2023, 3:11pm

Because every time there’s a non-inlined function call is seen as an opportunity for the scheduler to make new guesses about what to do.

Maybe it’d help if I clarified that I’m not talking about some high level task scheduler that just sees julia and a usage stat. There are very low level schedulers which are used for heterogeneous architectures like Apple Silicon that are making decisions based on incredibly granular information like what the bundle of instructions loaded into the cache are.

That is very much not the case generically.

Benny · September 28, 2023, 3:25pm

I assume reloading the code requires the setup methods to be recompiled so you have a bunch of new addresses to cache. CPUs cache instructions and data separately. I assume this is why Stabilizer randomizes code, stack, and heap independently.

this poked at my memory a bit, and the Stabilizer technical paper does mention another study where the environment affected job scheduling. Not sure if that’s related but after seeing a surname affect caching, I won’t be surprised if memory layout does somehow affect scheduling, maybe do jumping jacks

That is not my understanding. Multitasking OSs have so many processes they will interrupt a single-threaded program and restart it on a separate core.

user664303 · September 28, 2023, 5:34pm

My machine has 9 cores. Here is the usage when I’m running the code:

There is basically nothing else going on. Just for you, I’ve started Julia with:

sudo nice -n -20 julia

This sets the Julia process to the highest priority possible. Given these two things, the idea that something would interrupt the Julia process seems remote. But most of the things being suggested here seem remote to me.

Here’s the result of the code that I say is affected by memory layout:

Now here’s the result of the code that just does a small matrix multiply a load of times:

mbauman · September 28, 2023, 5:45pm

One more hypothesis: the MWE is using StaticArrays, which does a number of hairy things that are not 100% kosher. There’s a chance of some nasal demons flying around. I’ve seen some funny performance things with @pure in the past, for example.

Topic		Replies	Views
Benchmarking questions General Usage benchmark	3	327	September 18, 2023
@time and @benchmark give different results Performance question , benchmarktools	6	1541	August 16, 2021
How to specify the number of execution and the number of repetitions per execution in BenchmarkTools? General Usage question	16	2591	September 2, 2021
BenchmarkTools with simple, fast-running function New to Julia	3	2128	February 21, 2019
.= vs = speed difference New to Julia	2	553	June 12, 2019

Making @benchmark outputs statistically meaningful, and actionable

Related topics