Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

ash · January 26, 2022, 7:02pm

I tried the following:

Turning on multithreading with 16 threads. I have an @floop over a nested loop with 254 elements per loop, I ignored the single loop as per @tkf’s suggestion. Run time decreased from 590 ms to 510 ms.
Turning on OpenBLAS multithreading (16 threads). Run time went up from 510 ms to 783 ms (2498 ms with 8 threads).
Switching between MKL and OpenBLAS in serial and parallel. No appreciable change in serial, MKL is 4 times slower in parallel!!

When running this in parallel, the average CPU usage is low, like 200-300% or so.

Something funky is going on with BLAS and MKL.

Topic		Replies	Views
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5479	January 31, 2022
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2758	December 5, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36488	June 19, 2020
Why doesn't multithreading help here? Performance	12	1417	August 22, 2020
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4619	April 7, 2022