Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

Disclaimer: I didn’t read the entire thread / didn’t follow too closely.

Did you benchmark everything with BLAS.set_num_threads(1)? If not, note that you need to take when using MKL and, at the same time, multiple Julia threads (same can be said about OpenBLAS). I find that MKL_NUM_THREADS defaults to the number of cores of the system and since this number is used per Julia thread you will readily have too many MKL threads running (and thus oversubscribe your cores). See Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer for more.

TLDR: I suggest you benchmark everything with BLAS.set_num_threads(1) (independent of whether you use MKL or OpenBLAS).