Mac M2 Ultra brought to its knees

PetrKryslUCSD · June 9, 2025, 12:03am

This is the platform. I run a simulation which solves repeatedly a sparse system of complex linear equations. It can do ~60k system per frequency in around 4-5 seconds. If I start julia on two of these simultaneously, the solution per frequency increases to minutes. I do not request multiple threads explicitly, but I do notice that both a single process and two processes run on multiple threads. I have sufficient memory, when two processes are running, 154GB is free on top of everything that is running at that time.

Does anyone have an idea why the system would be totally trashed when two julia processes are started?

stevengj · June 9, 2025, 12:28am

BLAS defaults to multiple threads.

PetrKryslUCSD · June 9, 2025, 1:21am

For sure. But the processes seem to sharing well - each uses 12 threads.

photor · June 9, 2025, 1:37am

BLAS.set_num_threads(1)
and test again.

PetrKryslUCSD · June 9, 2025, 2:04am

But this is not like the situation where this is useful to prevent the libraries from clashing when running in a single process. These accesses to the libraries are happening in two different separate processes. Just saying…

ericphanson · June 9, 2025, 11:41am

I think the point is

It can do ~60k system per frequency in around 4-5 seconds. If I start julia on two of these simultaneously, the solution per frequency increases to minutes. I do not request multiple threads explicitly, but I do notice that both a single process and two processes run on multiple threads. I have sufficient memory, when two processes are running, 154GB is free on top of everything that is running at that time.

sounds like CPU oversubscription. You are running fine with one process, double it to two and everything slows down, and it’s not a memory/swapping issue. So look for what is using too many threads (which is not too many when only have 1 process but is too many when you have 2 processes) and fix that. And BLAS is a likely candidate since you mention you are not multithreading explicitly.

PetrKryslUCSD · June 9, 2025, 5:57pm

It does look like oversubscription. When I do not control blas threads, both processes use 12 cores, and the solution takes several minutes per frequency. When I limit both of the processes to just 8 cores (the machine has 16 performance cores, plus 8 I-forget-what-theyre-called cores), things slow down marginally (5 seconds), but certainly within the limits of acceptability.

It is interesting how much of an effect this has!

Next thing: try it out on one of the Linux machines.

Mason · June 9, 2025, 6:10pm

There was a big thread on Zulip about this a while ago. My fuzzy recollection is that Apple M-series chips don’t generally use their CPU cores to do BLAS math, the chip has a separate matrix-math coprocessor specifically for those tasks. It’s quite fast, but there’s fewer coprocessors than there are CPU cores, and AppleAccelerate / LBT ignores BLAS.set_num_threads.

Likely what is happening is that having separate julia processes both hammering the coprocessors is causing it them to be oversubscribed in a dumb way that kills performance.

danielwe · June 9, 2025, 6:13pm

Only if you load AppleAccelerate.jl. Since BLAS.set_num_threads has an effect here, I assume @PetrKryslUCSD isn’t doing that, as Accelerate completely ignores that parameter.

I think this processor should have at least 2 matrix coprocessors, so it should be able to handle two processes doing heavy Accelerate linalg simultaneously. Would be interesting to hear how the numbers change if using AppleAccelerate is added into the mix. (It’s typically extremely fast for forward operations, i.e., matmuls, while LAPACK solvers can be hit-or-miss. Since it sounds like this is a sparse matrix problem, that’s probably OK as I’m guessing the BLAS backend is mainly used for the matmul subroutine, not for the LAPACK solvers.)

Mason · June 9, 2025, 6:26pm

Ah right. Might be worth checking @PetrKryslUCSD if

any(x -> x.name == "AppleAccelerate", keys(Base.loaded_modules))

gives true or false for you. It’s likely false unless you intentionally did using AppleAccelerate, but it’s possible a transitive dependency brought it in.

PetrKryslUCSD · June 9, 2025, 6:28pm

False.

PetrKryslUCSD · June 9, 2025, 11:23pm

Hm, AppleAccelerate gives worse performance than plain BLAS.

danielwe · June 9, 2025, 11:25pm

Do you have any higher-level threading in your code, or is all the multithreading in BLAS?

PetrKryslUCSD · June 9, 2025, 11:34pm

All in BLAS.

danielwe · June 9, 2025, 11:38pm

Does anything change if you set the environment variable VECLIB_MAXIMUM_THREADS=1 for these processes? What about VECLIB_MAXIMUM_THREADS=2?

I think your CPU has 4 AMX coprocessors, one for each of the 4 banks of 4 P-cores. That env var could help your processes share those resources better. It’s the equivalent of BLAS.set_num_threads for Apple Accelerate (sadly, there’s no API to set this from within the process that BLAS.set_num_threads could hook up to).

Or did you also see a slowdown when only running a single process?

PetrKryslUCSD · June 9, 2025, 11:42pm

Precisely.

danielwe · June 9, 2025, 11:45pm

Oh, well, then I guess you have enough cores that OpenBLAS running on the regular CPU cores beats Accelerate running on AMX. That’s definitely not the case on my M4 Pro, but I both have fewer regular cores and a newer AMX design, so I guess that makes sense.

Another explanation could be that your code actually uses dense factorizations/linsolves/eigensolves as subroutines in the sparse computations, in which case it’s much less surprising that OpenBLAS is faster. Accelerate is first and foremost a matmul beast.

Is OpenBLAS still faster when you set BLAS.set_num_threads(8), which is what you needed to scale to two processes? What about BLAS.set_num_threads(4), which I presume is needed to scale to 4? Accelerate with VECLIB_MAXIMUM_THREADS=1 should in principle scale linearly up to 4 simultaneous processes.

PetrKryslUCSD · June 9, 2025, 11:53pm

Lots of options. I will try to look at it. Thanks for your input!

PetrKryslUCSD · June 10, 2025, 3:36am

For comparison. This machine

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sandybridge)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

does not slow down more than a few percent when two processes are running on all cores, no setting of the maximum number of threads in BLAS.

So, I’d say there is something wrong in the handling of threads in the Mac BLAS. Or, in fact, in the MAC OS thread scheduler.

Topic		Replies	Views
Slow down when running several parallel julia processes which use BLAS (MWE is provided) Performance	10	1375	January 26, 2018
Running several Julia engines Performance	6	841	January 27, 2020
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2187	May 30, 2017
Julia under rosetta 2 on mac m1: threading/scheduling issues with openblas? Internals & Design mac-m1	2	1056	November 26, 2021
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1365	February 15, 2017

Mac M2 Ultra brought to its knees

Related topics