OpenBLAS vs MKL

carstenbauer · November 9, 2018, 8:19pm

I just build Julia 1.0.1 on a cluster and linked it against MKL (2019). As a simple benchmark, I compared the performance of squaring a 1000 x 1000 matrix against the Julia binaries from the website (OpenBLAS). I remember I did this for 0.6.4 at some point and found MKL to be faster by 30% or so. However, I’m blown away by the difference I found this time:

MKL:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8* (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> using LinearAlgebra; LinearAlgebra.versioninfo()
BLAS: libmkl_rt
LAPACK: libmkl_rt

julia> using BenchmarkTools

julia> A = rand(1000,1000);

julia> @btime $A*$A;
  1.926 ms (2 allocations: 7.63 MiB)

OpenBLAS:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> using LinearAlgebra; LinearAlgebra.versioninfo()
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY SkylakeX MAX_THREADS=16)
LAPACK: libopenblas64_

julia> using BenchmarkTools

julia> A = rand(1000,1000);

julia> @btime $A*$A;
  7.905 ms (2 allocations: 7.63 MiB)

julia> @btime $A*$A;
  7.124 ms (2 allocations: 7.63 MiB)

davidbp · November 9, 2018, 8:54pm

It would be nice for you to share this but for matrices of different sizes. (10,10), (100,100), (500,500), (1000,1000), (10000,10000) maybe MKL is very good for some sizes but not others. In any case a 4x is relevant but not an insane number. Unless your code is completely dominated by matrix multiplications it won’t be such a big deal.

Elrod · November 9, 2018, 9:39pm

The cluster has avx-512:

SkylakeX

Also, see Intel Ark Xeon(R) Gold 6148
The latest OpenBLAS has finally gotten some support for avx512 dgemm, but their kernels are still far from optimal. Julia 1.0.1 does not come with the latest OpenBLAS.

I see a similar difference on my Skylake-X cpu. OpenBLAS does far better on avx2 architectures, like Haswell.

carstenbauer · November 10, 2018, 1:37am

openblas_vs_mkl timings

My code is completely dominated by matrix multiplications.

e3c6 · January 15, 2020, 7:56pm

Wow that’s a big difference. After one year the picture remains the same?

Mason · January 15, 2020, 8:08pm

Most of the performance gap is likely due to BLAS threads should default to physical not logical core count? · Issue #33409 · JuliaLang/julia · GitHub

Oscar_Smith · January 15, 2020, 11:09pm

MKL might also switch over to Strassen’s method or similar sub n^3 methods when the size gets big.

Elrod · January 16, 2020, 1:01am

I don’t think it does based on LinearAlgebra.peakflops(N).

Topic		Replies	Views
MKL slower than openblas in intel cpu Performance mkl , linearalgebra	7	3498	March 2, 2022
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36458	June 19, 2020
Poor openBLAS performance for large matrix multiply? New to Julia openblas	17	1209	April 4, 2025
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4611	April 7, 2022
Computational Speed General Usage question	20	2410	April 1, 2019

OpenBLAS vs MKL

Related topics