Again, x86 architecture. I assume that some of you have Apple silicon…
Is your system apple silicon? That’d make me even less likely to suspect your performance puzzle is related to memory layout (though of course it’s far from ruled out!), since MacOS with apple silicon has a rather complicated and different scheduler than what most x86 chips use (though Intel’s more recent CPUs may have some similar characteristics due to them also having heterogeneous cores). Apple Silicon’s mobile lineage really shows not only in having big and little cores, but also spikey overclocking behaviour, and lots of things where the system is trying to reason about how many resources to give a certain process, and trying to turbocharge what it suspects are short-lived, latency sensitive processes while deprioritizing what it thinks might be more throughput based, long lived processes.
MacOS is also notorious for spinning up and down highly resource intensive background processes.
On the contrary, the MWE was very enlightening. People were able to attempt it and verify the machine-dependence, and people could suggest changes. I especially appreciate the boxplot graph variant, it’s much clearer than a few @benchmark
printouts. The skepticism is just part of fact-finding. Just because people have an open mind about alternative causes and confounding factors doesn’t mean they aren’t exploring the cause you want to address. In fact, all suggestions so far explore memory layout.
As a moderator here, I’d like to try to de-escalate a bit. Please consider what want you’d want from this thread. IMV, the amount of engagement you’ve gotten here is great. Yes, there’s skepticism, but people are engaging. And that’s far better than someone tacitly saying “ok, cool story” and moving on.
This is could — and should — be a fun “whodunnit” when approached with an open mind, curiosity, and respect for others. Let’s please put a pin in trying to police the commentary itself.
Yes, my system has an M1 Pro CPU.
This issue is not related to scheduling, as I understand, but related to cache line conflicts, branch prediction conflicts and such like.
I think you’re focusing on the prior, and not on the data. I have shown several experiments where there is no significant change in run time across multiple benchmarks. And I have shown several where there is. If these other factors were at play, as you suggest, then I would see this behaviour across every single benchmark I run - not just the ones where I claim there is an issue. I also return to another point I made before - how does your proposition make the change in run time happen precisely between runs of @benchmark
, and not during?
Well, I’d actually disagree a bit there. I mean, that’s what most people here want out of this thread, but it’s somewhat explicitly not what @user664303 wanted out of it, they mostly were posting about how they wanted a better way for BenchmarkTools.jl to account for memory alignment and make more comparable benchmarks.
Most people’s attitude seems to be “yeah that’d be nice but it’s hard enough to do that it’s not really actionable. But anyways, lets dig into what may or may not actually be happening here since there are good reasons you might have mistaken beliefs about the cause of this”
And on that subject, I did find that someone is maintaining the Stabilizer tool, and posed this question to them.
Why code reloading may cause the issue but regenerating the data doesn’t?
Please do take the poll. The last one was good.
I guess I’d just say that as this thread progresses, I’m increasingly less convinced that alignment is the relevant variable here and more convinced that it’s the scheduler, and that’s not because of the prior but because of the data you and others have shown.
Maybe you could help me out by explaining what is making you think this can’t possibly be related to your computer’s rather different scheduler, heuristics, and your OS’s rather different approach to resource intensive background tasks?
Because your CPU scheduler sees those @benchmark
runs as separate work units. It can and does make guesses about what to do with an individual function and weighs that infromation against all sorts of opaque things like the current thermal conditions, what other programs are scheduled, whether or not Mercury is in retrograde, etc.
Good question. I’d be guessing… Perhaps branch prediction conflicts, perhaps changing the stack address slightly. Who knows.
Though I should reiterate that I agree with your original premise regardless , because it’d be good if BenchmarkTools.jl were able to isolate things like alignment because it would make it much easier to know whether or not that was causing the problem!
Now that were running 20 calls to @benchmark in a loop, how can the scheduler determine that they’re different runs?
One reason is that the timing always changes between code reloads. See the point I make above. Another is that my code is intensive and single threaded. There isn’t much else going on on my machine. It will just get thrown onto a high power core and run at full throttle.
It’s funny that we’re still talking about my code/results, though, isn’t it.
Because every time there’s a non-inlined function call is seen as an opportunity for the scheduler to make new guesses about what to do.
Maybe it’d help if I clarified that I’m not talking about some high level task scheduler that just sees julia
and a usage stat. There are very low level schedulers which are used for heterogeneous architectures like Apple Silicon that are making decisions based on incredibly granular information like what the bundle of instructions loaded into the cache are.
That is very much not the case generically.
I assume reloading the code requires the setup methods to be recompiled so you have a bunch of new addresses to cache. CPUs cache instructions and data separately. I assume this is why Stabilizer randomizes code, stack, and heap independently.
this poked at my memory a bit, and the Stabilizer technical paper does mention another study where the environment affected job scheduling. Not sure if that’s related but after seeing a surname affect caching, I won’t be surprised if memory layout does somehow affect scheduling, maybe do jumping jacks
That is not my understanding. Multitasking OSs have so many processes they will interrupt a single-threaded program and restart it on a separate core.
My machine has 9 cores. Here is the usage when I’m running the code:
There is basically nothing else going on. Just for you, I’ve started Julia with:
sudo nice -n -20 julia
This sets the Julia process to the highest priority possible. Given these two things, the idea that something would interrupt the Julia process seems remote. But most of the things being suggested here seem remote to me.
Here’s the result of the code that I say is affected by memory layout:
Now here’s the result of the code that just does a small matrix multiply a load of times:
One more hypothesis: the MWE is using StaticArrays, which does a number of hairy things that are not 100% kosher. There’s a chance of some nasal demons flying around. I’ve seen some funny performance things with @pure
in the past, for example.