Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

carstenbauer · January 26, 2022, 8:06pm

Disclaimer: I didn’t read the entire thread / didn’t follow too closely.

Did you benchmark everything with BLAS.set_num_threads(1)? If not, note that you need to take when using MKL and, at the same time, multiple Julia threads (same can be said about OpenBLAS). I find that MKL_NUM_THREADS defaults to the number of cores of the system and since this number is used per Julia thread you will readily have too many MKL threads running (and thus oversubscribe your cores). See Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer for more.

TLDR: I suggest you benchmark everything with BLAS.set_num_threads(1) (independent of whether you use MKL or OpenBLAS).

Topic		Replies	Views
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5453	January 31, 2022
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2758	December 5, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36477	June 19, 2020
Why doesn't multithreading help here? Performance	12	1415	August 22, 2020
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4617	April 7, 2022

Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

Related topics