MKL slower than openblas in intel cpu

Jae-Mo_Lihm · February 28, 2022, 3:00am

I found MKL is ~1.5 times slower than openblas for matrix-matrix multiplication (both without multithreading). Since I’m using intel CPU, I expected MKL to be faster.

Is there anything obviously wrong in this benchmark? Or is it not unexpected for MKL to be slower than openblas?

using LinearAlgebra, StaticArrays, BenchmarkTools
N = 400
M = 5000
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, 3);
y = rand(ComplexF64, N, 3);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 3.323 ms (0 allocations: 0 bytes)

using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 5.168 ms (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA = /home/jmlim/appl/julia-1.7.0/bin/julia

Jae-Mo_Lihm · February 28, 2022, 3:03am

I found the performance is specific to that matrix size. (very narrow columns)

using LinearAlgebra, StaticArrays, BenchmarkTools
N = 300
M = 300
K = 300
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, K);
y = rand(ComplexF64, N, K);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.209 ms (0 allocations: 0 bytes)

using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.015 ms (0 allocations: 0 bytes)

photor · February 28, 2022, 3:42am

So, it’s normal.

ImreSamu · February 28, 2022, 5:31am

With MKL.jl v0.5.0? (released this 2 days ago)
Could you please test with the new Julia v1.8.0-beta1

Seif_Shebl · February 28, 2022, 7:02pm

I can’t reproduce.

  7.244 ms (0 allocations: 0 bytes)  OpenBLAS
  4.748 ms (0 allocations: 0 bytes)  MKL (0.5.0)

On an 11 days old Julia 1.8.0-DEV with Intel CPU:

Julia Version 1.8.0-DEV.1572
Commit 7889b2a6a2 (2022-02-16 21:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 8 on 8 virtual cores

Jae-Mo_Lihm · March 2, 2022, 1:22am

Yes, I get basically the same timings with v1.8.0-beta1 + MKL 0.5.0.

Jae-Mo_Lihm · March 2, 2022, 1:23am

Thanks for running the benchmark. The problem seems to be specific to my CPU.

To be clear, did you run the one in the OP ( (400, 5000) * (5000, 3)) or the one in the reply ( (300, 300) * (300, 300) )?

Seif_Shebl · March 2, 2022, 3:48pm

The results above are for the sizes in the OP, (400, 5000) \times(5000, 3), and the following are the results for (300, 300)\times(300, 300).

LinearAlgebra.BLAS.set_num_threads(1)
...
  3.819 ms (0 allocations: 0 bytes)  OpenBLAS
  3.617 ms (0 allocations: 0 bytes)  MKL (0.5.0)

Topic		Replies	Views
OpenBLAS vs MKL General Usage mkl	7	15770	January 16, 2020
Using MKL in 1.7 Performance package	11	1618	December 8, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36472	June 19, 2020
Computational Speed General Usage question	20	2412	April 1, 2019
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4615	April 7, 2022

MKL slower than openblas in intel cpu

Related topics