MKL slower than openblas in intel cpu

I found MKL is ~1.5 times slower than openblas for matrix-matrix multiplication (both without multithreading). Since I’m using intel CPU, I expected MKL to be faster.

Is there anything obviously wrong in this benchmark? Or is it not unexpected for MKL to be slower than openblas?

using LinearAlgebra, StaticArrays, BenchmarkTools
N = 400
M = 5000
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, 3);
y = rand(ComplexF64, N, 3);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 3.323 ms (0 allocations: 0 bytes)

using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 5.168 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA = /home/jmlim/appl/julia-1.7.0/bin/julia

I found the performance is specific to that matrix size. (very narrow columns)

using LinearAlgebra, StaticArrays, BenchmarkTools
N = 300
M = 300
K = 300
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, K);
y = rand(ComplexF64, N, K);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.209 ms (0 allocations: 0 bytes)

using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.015 ms (0 allocations: 0 bytes)

So, it’s normal.

1 Like
  • With MKL.jl v0.5.0? (released this 2 days ago)
  • Could you please test with the new Julia v1.8.0-beta1
1 Like

I can’t reproduce.

  7.244 ms (0 allocations: 0 bytes)  OpenBLAS
  4.748 ms (0 allocations: 0 bytes)  MKL (0.5.0)

On an 11 days old Julia 1.8.0-DEV with Intel CPU:

Julia Version 1.8.0-DEV.1572
Commit 7889b2a6a2 (2022-02-16 21:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 8 on 8 virtual cores
1 Like

Yes, I get basically the same timings with v1.8.0-beta1 + MKL 0.5.0.

1 Like

Thanks for running the benchmark. The problem seems to be specific to my CPU.

To be clear, did you run the one in the OP ( (400, 5000) * (5000, 3)) or the one in the reply ( (300, 300) * (300, 300) )?

The results above are for the sizes in the OP, (400, 5000) \times(5000, 3), and the following are the results for (300, 300)\times(300, 300).

LinearAlgebra.BLAS.set_num_threads(1)
...
  3.819 ms (0 allocations: 0 bytes)  OpenBLAS
  3.617 ms (0 allocations: 0 bytes)  MKL (0.5.0)
1 Like