I tried the following:
-
Turning on multithreading with 16 threads. I have an
@floop
over a nested loop with 254 elements per loop, I ignored the single loop as per @tkf’s suggestion. Run time decreased from 590 ms to 510 ms. -
Turning on OpenBLAS multithreading (16 threads). Run time went up from 510 ms to 783 ms (2498 ms with 8 threads).
-
Switching between MKL and OpenBLAS in serial and parallel. No appreciable change in serial, MKL is 4 times slower in parallel!!
When running this in parallel, the average CPU usage is low, like 200-300% or so.
Something funky is going on with BLAS and MKL.