Performace problems with Windows scheduler and multithreading on mixed core CPUs

I have a function with a long running simulation (> 20 min) and used Threads@spawn to start four of them in parallel. Disappointingly the runtime more than doubled compared to a single instance run. After some digging I found out that the Windows scheduler had placed all four compute threads on the four small Zen 5c CPUs of my laptop while the four big cores were idle…

If I set the priority of the julia process to high by hand in the taskmanager the threads are migrated over to the big cores and I only have about 10% longer runtimes if four are running at the same time. (Note: all tests with power supply connected and Windows power setting set to maximum performance)

So the question: is there a way to start julia with higher prio, especially when starting it from VSCode? Or is there a command line option for julia?

This is on a Dell laptop with Ryzen 7 AI 350 CPU (4 big cores / 4 small cores). Don’t know if Intel CPU with mixed performance/efficiency cores show the same effect.

Hope someone has an idea how to solve this permanently without manual intervention via the task manager.

1 Like

It’s somewhat surprising to me that the zen5c cores were so much slower. Other than clock and l2 cache size, they are identical to the big zen5 cores.

1 Like

But the max clock is 5 GHz vs 3.3 GHz and the L3 is only half. Maybe this already results in a factor of two for my simulations which work on relatively large arrays.

3 Likes

yeah, that’s very possible. I’d be really interested to see if the Linux scheduler does a better job, but I don’t think there’s an easy way to launch julia with higher priority.

You can use GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU-threads on Linux, but not on Windows. I am not sure if this is a limitation of the OS or of the package.

Work-around:

  • Manual Affinity with Task Manager: After launching Julia, go to Windows Task Manager, locate julia.exe, right-click, choose “Set affinity”, and manually select the performance cores (consult your CPU documentation for which logical CPUs match P-cores).
  • The start command in Windows can be used with an affinity mask, but you must know the mapping of logical processors to physical P-cores and E-cores for your CPU.

Run on Cores 0, 1, 2, 3 (first four P-cores):

start /affinity F julia -t 4

Something like this might work, but I did not test it, I do not have Windows.

2 Likes

It seems like on OS bug, or at least the OS needing improvement. I’m not saying it can know you want the faster cores, but should migrate to them when it feels like it would help?!

You can enforce that (hypothetically, or on Linux) but is that the way forward? If every program would do that for users, or Julia for them, then there would be no users of the slower/efficiency cores. Is this best left up to the programmer?

I see mobile has 3 levels of cores already, and I’ve thought about bringing up here (or on offtopic), since would be eventually relevant for Julia, but I didn’t think it relevant already (I’m apperatly behind on tech) for desktop/laptop, not even 2 levels like “BigLITTLE”. Maybe Julia needs redesigning and some support for such 2- or more level? Does it; or e.g. OpenBLAS? have any? MKL?

[This is independent of language, so what are other languages doing, e.g. even on mobile that had this first? Special support in Android and/or iOS? By now macOS?]

This has I think nothing directly, only indirectly, to do with big cores. It means just that, they get higher priority, more often scheduled, and I’m guessing some performance counters kick in. Could the code have some locking issues? I think if you have CPU bound code, for one core even, or many, then the OS should figure it out. If the program isn’t constantly running, e.g. doing I/O then it likely hurts preventing the OS to figure out it’s CPU demanding?! Are you running Julia 1.12.0 (which seemingly is already released, seemingly will be announced today), i.e. with the default 1 interactive thread?

I find lots of interesting papers relating (not just these, and the MSc thesis):
https://dl.acm.org/doi/pdf/10.1109/SBCCI62366.2024.10703981

irregular microarchitecture challenges the programmer to fully explore the parallelism potential of many parallel applications. In this paper, we propose Mímir, a library for automatically finding, at runtime, the ideal number of threads for each parallel region of OpenMP applications executing on AMPs. It is transparent to the end-user, requiring no changes in the source code or recompilation. Our experiments, considering eleven parallel applications executed on an Intel Alder Lake, report that Mímir can reduce, on average, 74.79%, 72.37%, and 68.89% of the Energy-Delay Product of the applications, respectively, considering all cores, only P-Cores, and only E-Cores.

https://arxiv.org/pdf/1702.04028

Limitation of the package. I don’t have great access to windows machines and frankly don’t care too much about windows. Could be supported though.

1 Like

That would be great. Thank you very much in advance in case you do it!

NB. For the record I wonder if ThreadPinning works on WSL2 … as cpu assignment is still managed by Windows? Anyone knowing something about it?

@Palli was right. After some more observations I recognized that the priority has no direct impact. What really causes the scheduler to put the julia threads to the small cores, is if the julia window (VSCode in my case) is the foreground window or put to the back. If you have other interactive applications like the browser in the foreground, the compute threads are migrated to the small cores. If I put the VSCode window in the foreground again, after a few seconds all are back using the big cores. So CPU pinning would be the way to go, if there someday it is possible…

1 Like

Did you try:

start /affinity F julia -t 4

from the VSCode terminal?
(If you have 4 fast cores, otherwise change the number.)

Instead of using affinity or higher priority of Julia, should you maybe lower priority of VSCode, since this is about interference of it (or the OS thinking it more or as important)? The only possible problem with that I could see is if Julia would inherit the same lower priority since run from within VSCode, so only better if that can be avoided. If that’s is a good fix, then maybe change VSCode to voluntarily lower its priority a bit (while avoiding doing so for the Julia process it runs)…? Can someone also check with or without such a fix elsewhere, e.g. on Linux? The performance was never a problem in the REPL only (on Windows) I assume without VSCode?

I see F there is a bit mask for first 4 cores (those likely to be performance cores, and higher numbers for efficiencycores for sure, or them interleaved?), you might also try to rather pin VSCode on the efficiency cores, only (don’t know the numbers they have so look that up or experiment with some values like 1, 2, 8, 16 …), again this might not work, if Julia inherits that affinity… as a sub-process (likely):

Upon further digging utilizing HWiNFO64, it would appear that indeed CPU0-CPU7 are the P-cores and not hyperthreading but that doesn’t excuse the idea that “the rest will be your E-cores”. 8 of those remaining 16 are hyperthreaded cores.

Not sure if accurate info above; I’m not on Windows, I suppose you should use that tool to find the E-cores, or alternatively where appropriate P-cores, and calculate the bitmask you want:
https://www.hwinfo.com/download/

This is a real nuisance. I have an Intel i7-13850HX running Windows 11. Our application needs to keep up with an live data stream, which it just about does using Threads.@threads for critical loops, but only if Julia only uses 8 threads (the processor has 8 performance cores). Updating from Julia 1.11 to 1.12 slowed everything down until I finally realised I had to launch Julia with --threads 7,1 instead of --threads 8.

On the whole, Windows seems to so a reasonable job of keeping the processing on the performance cores, even launching Julia from VSCode. Using start /affinity is slightly better, but you have to skip every other CPU as performance cores have hyperthreading and count as two. For mine, 0x5555 seems to work but I don’t usually bother as it’s not normally much better. Maybe AMD processors behave differently.

2 Likes