This is the platform. I run a simulation which solves repeatedly a sparse system of complex linear equations. It can do ~60k system per frequency in around 4-5 seconds. If I start julia on two of these simultaneously, the solution per frequency increases to minutes. I do not request multiple threads explicitly, but I do notice that both a single process and two processes run on multiple threads. I have sufficient memory, when two processes are running, 154GB is free on top of everything that is running at that time.
Does anyone have an idea why the system would be totally trashed when two julia processes are started?
But this is not like the situation where this is useful to prevent the libraries from clashing when running in a single process. These accesses to the libraries are happening in two different separate processes. Just saying…
It can do ~60k system per frequency in around 4-5 seconds. If I start julia on two of these simultaneously, the solution per frequency increases to minutes. I do not request multiple threads explicitly, but I do notice that both a single process and two processes run on multiple threads. I have sufficient memory, when two processes are running, 154GB is free on top of everything that is running at that time.
sounds like CPU oversubscription. You are running fine with one process, double it to two and everything slows down, and it’s not a memory/swapping issue. So look for what is using too many threads (which is not too many when only have 1 process but is too many when you have 2 processes) and fix that. And BLAS is a likely candidate since you mention you are not multithreading explicitly.
It does look like oversubscription. When I do not control blas threads, both processes use 12 cores, and the solution takes several minutes per frequency. When I limit both of the processes to just 8 cores (the machine has 16 performance cores, plus 8 I-forget-what-theyre-called cores), things slow down marginally (5 seconds), but certainly within the limits of acceptability.
It is interesting how much of an effect this has!
Next thing: try it out on one of the Linux machines.
There was a big thread on Zulip about this a while ago. My fuzzy recollection is that Apple M-series chips don’t generally use their CPU cores to do BLAS math, the chip has a separate matrix-math coprocessor specifically for those tasks. It’s quite fast, but there’s fewer coprocessors than there are CPU cores, and AppleAccelerate / LBT ignores BLAS.set_num_threads.
Likely what is happening is that having separate julia processes both hammering the coprocessors is causing it them to be oversubscribed in a dumb way that kills performance.
Only if you load AppleAccelerate.jl. Since BLAS.set_num_threads has an effect here, I assume @PetrKryslUCSD isn’t doing that, as Accelerate completely ignores that parameter.
I think this processor should have at least 2 matrix coprocessors, so it should be able to handle two processes doing heavy Accelerate linalg simultaneously. Would be interesting to hear how the numbers change if using AppleAccelerate is added into the mix. (It’s typically extremely fast for forward operations, i.e., matmuls, while LAPACK solvers can be hit-or-miss. Since it sounds like this is a sparse matrix problem, that’s probably OK as I’m guessing the BLAS backend is mainly used for the matmul subroutine, not for the LAPACK solvers.)
gives true or false for you. It’s likely false unless you intentionally did using AppleAccelerate, but it’s possible a transitive dependency brought it in.
Does anything change if you set the environment variable VECLIB_MAXIMUM_THREADS=1 for these processes? What about VECLIB_MAXIMUM_THREADS=2?
I think your CPU has 4 AMX coprocessors, one for each of the 4 banks of 4 P-cores. That env var could help your processes share those resources better. It’s the equivalent of BLAS.set_num_threads for Apple Accelerate (sadly, there’s no API to set this from within the process that BLAS.set_num_threads could hook up to).
Or did you also see a slowdown when only running a single process?
Oh, well, then I guess you have enough cores that OpenBLAS running on the regular CPU cores beats Accelerate running on AMX. That’s definitely not the case on my M4 Pro, but I both have fewer regular cores and a newer AMX design, so I guess that makes sense.
Another explanation could be that your code actually uses dense factorizations/linsolves/eigensolves as subroutines in the sparse computations, in which case it’s much less surprising that OpenBLAS is faster. Accelerate is first and foremost a matmul beast.
Is OpenBLAS still faster when you set BLAS.set_num_threads(8), which is what you needed to scale to two processes? What about BLAS.set_num_threads(4), which I presume is needed to scale to 4? Accelerate with VECLIB_MAXIMUM_THREADS=1 should in principle scale linearly up to 4 simultaneous processes.