Disclaimer: I didn’t read the entire thread / didn’t follow too closely.
Did you benchmark everything with BLAS.set_num_threads(1)
? If not, note that you need to take when using MKL and, at the same time, multiple Julia threads (same can be said about OpenBLAS). I find that MKL_NUM_THREADS
defaults to the number of cores of the system and since this number is used per Julia thread you will readily have too many MKL threads running (and thus oversubscribe your cores). See Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer for more.
TLDR: I suggest you benchmark everything with BLAS.set_num_threads(1)
(independent of whether you use MKL or OpenBLAS).