I found MKL is ~1.5 times slower than openblas for matrix-matrix multiplication (both without multithreading). Since I’m using intel CPU, I expected MKL to be faster.
Is there anything obviously wrong in this benchmark? Or is it not unexpected for MKL to be slower than openblas?
using LinearAlgebra, StaticArrays, BenchmarkTools
N = 400
M = 5000
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, 3);
y = rand(ComplexF64, N, 3);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 3.323 ms (0 allocations: 0 bytes)
using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 5.168 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
JULIA = /home/jmlim/appl/julia-1.7.0/bin/julia
I found the performance is specific to that matrix size. (very narrow columns)
using LinearAlgebra, StaticArrays, BenchmarkTools
N = 300
M = 300
K = 300
A = rand(ComplexF64, N, M);
x = rand(ComplexF64, M, K);
y = rand(ComplexF64, N, K);
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.209 ms (0 allocations: 0 bytes)
using MKL
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($y, $A, $x);
# 2.015 ms (0 allocations: 0 bytes)
I can’t reproduce.
7.244 ms (0 allocations: 0 bytes) OpenBLAS
4.748 ms (0 allocations: 0 bytes) MKL (0.5.0)
On an 11 days old Julia 1.8.0-DEV with Intel CPU:
Julia Version 1.8.0-DEV.1572
Commit 7889b2a6a2 (2022-02-16 21:17 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 8 on 8 virtual cores
1 Like
Yes, I get basically the same timings with v1.8.0-beta1 + MKL 0.5.0.
1 Like
Thanks for running the benchmark. The problem seems to be specific to my CPU.
To be clear, did you run the one in the OP ( (400, 5000) * (5000, 3)) or the one in the reply ( (300, 300) * (300, 300) )?
The results above are for the sizes in the OP, (400, 5000) \times(5000, 3), and the following are the results for (300, 300)\times(300, 300).
LinearAlgebra.BLAS.set_num_threads(1)
...
3.819 ms (0 allocations: 0 bytes) OpenBLAS
3.617 ms (0 allocations: 0 bytes) MKL (0.5.0)
1 Like