OpenBLAS vs MKL

mkl

#1

I just build Julia 1.0.1 on a cluster and linked it against MKL (2019). As a simple benchmark, I compared the performance of squaring a 1000 x 1000 matrix against the Julia binaries from the website (OpenBLAS). I remember I did this for 0.6.4 at some point and found MKL to be faster by 30% or so. However, I’m blown away by the difference I found this time:

MKL:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8* (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> using LinearAlgebra; LinearAlgebra.versioninfo()
BLAS: libmkl_rt
LAPACK: libmkl_rt

julia> using BenchmarkTools

julia> A = rand(1000,1000);

julia> @btime $A*$A;
  1.926 ms (2 allocations: 7.63 MiB)

OpenBLAS:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> using LinearAlgebra; LinearAlgebra.versioninfo()
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY SkylakeX MAX_THREADS=16)
LAPACK: libopenblas64_

julia> using BenchmarkTools

julia> A = rand(1000,1000);

julia> @btime $A*$A;
  7.905 ms (2 allocations: 7.63 MiB)

julia> @btime $A*$A;
  7.124 ms (2 allocations: 7.63 MiB)

#2

It would be nice for you to share this but for matrices of different sizes. (10,10), (100,100), (500,500), (1000,1000), (10000,10000) maybe MKL is very good for some sizes but not others. In any case a 4x is relevant but not an insane number. Unless your code is completely dominated by matrix multiplications it won’t be such a big deal.


#3

The cluster has avx-512:

SkylakeX

Also, see Intel Ark Xeon® Gold 6148
The latest OpenBLAS has finally gotten some support for avx512 dgemm, but their kernels are still far from optimal. Julia 1.0.1 does not come with the latest OpenBLAS.

I see a similar difference on my Skylake-X cpu. OpenBLAS does far better on avx2 architectures, like Haswell.


#4

openblas_vs_mkl timings

My code is completely dominated by matrix multiplications.