Performace problems with Windows scheduler and multithreading on mixed core CPUs

I have a function with a long running simulation (> 20 min) and used Threads@spawn to start four of them in parallel. Disappointingly the runtime more than doubled compared to a single instance run. After some digging I found out that the Windows scheduler had placed all four compute threads on the four small Zen 5c CPUs of my laptop while the four big cores were idle…

If I set the priority of the julia process to high by hand in the taskmanager the threads are migrated over to the big cores and I only have about 10% longer runtimes if four are running at the same time. (Note: all tests with power supply connected and Windows power setting set to maximum performance)

So the question: is there a way to start julia with higher prio, especially when starting it from VSCode? Or is there a command line option for julia?

This is on a Dell laptop with Ryzen 7 AI 350 CPU (4 big cores / 4 small cores). Don’t know if Intel CPU with mixed performance/efficiency cores show the same effect.

Hope someone has an idea how to solve this permanently without manual intervention via the task manager.

1 Like

It’s somewhat surprising to me that the zen5c cores were so much slower. Other than clock and l2 cache size, they are identical to the big zen5 cores.

1 Like

But the max clock is 5 GHz vs 3.3 GHz and the L3 is only half. Maybe this already results in a factor of two for my simulations which work on relatively large arrays.

2 Likes

yeah, that’s very possible. I’d be really interested to see if the Linux scheduler does a better job, but I don’t think there’s an easy way to launch julia with higher priority.

You can use GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU-threads on Linux, but not on Windows. I am not sure if this is a limitation of the OS or of the package.

Work-around:

  • Manual Affinity with Task Manager: After launching Julia, go to Windows Task Manager, locate julia.exe, right-click, choose “Set affinity”, and manually select the performance cores (consult your CPU documentation for which logical CPUs match P-cores).
  • The start command in Windows can be used with an affinity mask, but you must know the mapping of logical processors to physical P-cores and E-cores for your CPU.

Run on Cores 0, 1, 2, 3 (first four P-cores):

start /affinity F julia -t 4

Something like this might work, but I did not test it, I do not have Windows.

1 Like

It seems like on OS bug, or at least the OS needing improvement. I’m not saying it can know you want the faster cores, but should migrate to them when it feels like it would help?!

You can enforce that (hypothetically, or on Linux) but is that the way forward? If every program would do that for users, or Julia for them, then there would be no users of the slower/efficiency cores. Is this best left up to the programmer?

I see mobile has 3 levels of cores already, and I’ve thought about bringing up here (or on offtopic), since would be eventually relevant for Julia, but I didn’t think it relevant already (I’m apperatly behind on tech) for desktop/laptop, not even 2 levels like “BigLITTLE”. Maybe Julia needs redesigning and some support for such 2- or more level? Does it; or e.g. OpenBLAS? have any? MKL?

[This is independent of language, so what are other languages doing, e.g. even on mobile that had this first? Special support in Android and/or iOS? By now macOS?]

This has I think nothing directly, only indirectly, to do with big cores. It means just that, they get higher priority, more often scheduled, and I’m guessing some performance counters kick in. Could the code have some locking issues? I think if you have CPU bound code, for one core even, or many, then the OS should figure it out. If the program isn’t constantly running, e.g. doing I/O then it likely hurts preventing the OS to figure out it’s CPU demanding?! Are you running Julia 1.12.0 (which seemingly is already released, seemingly will be announced today), i.e. with the default 1 interactive thread?

I find lots of interesting papers relating (not just these, and the MSc thesis):
https://dl.acm.org/doi/pdf/10.1109/SBCCI62366.2024.10703981

irregular microarchitecture challenges the programmer to fully explore the parallelism potential of many parallel applications. In this paper, we propose Mímir, a library for automatically finding, at runtime, the ideal number of threads for each parallel region of OpenMP applications executing on AMPs. It is transparent to the end-user, requiring no changes in the source code or recompilation. Our experiments, considering eleven parallel applications executed on an Intel Alder Lake, report that Mímir can reduce, on average, 74.79%, 72.37%, and 68.89% of the Energy-Delay Product of the applications, respectively, considering all cores, only P-Cores, and only E-Cores.

https://arxiv.org/pdf/1702.04028

Limitation of the package. I don’t have great access to windows machines and frankly don’t care too much about windows. Could be supported though.

1 Like