Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

I tried the following:

  1. Turning on multithreading with 16 threads. I have an @floop over a nested loop with 254 elements per loop, I ignored the single loop as per @tkf’s suggestion. Run time decreased from 590 ms to 510 ms.

  2. Turning on OpenBLAS multithreading (16 threads). Run time went up from 510 ms to 783 ms (2498 ms with 8 threads).

  3. Switching between MKL and OpenBLAS in serial and parallel. No appreciable change in serial, MKL is 4 times slower in parallel!!

When running this in parallel, the average CPU usage is low, like 200-300% or so.

Something funky is going on with BLAS and MKL.