Is MKL performance on AMD no longer crippled?

meisel · May 9, 2024, 9:12pm

I recently tried MKL 2024.1 on an AMD cloud instance (1 vCPU on top of a 4th gen AMD EPYC CPU), and found it to have very reasonable performance. In fact, it was better than another matmul library I tried. Has Intel finally made MKL reasonably fast on AMD processors without needing hacks like MKL_DEBUG_CPU_TYPE=5? It would save me a lot of trouble if so.

Oscar_Smith · May 9, 2024, 9:57pm

I believe it’s single threaded performance is normal, but if you are using multiple threads, it will do intentionally dumb stuff to use all the threads even if the matrix is way too small for that to be a good idea.

Elrod · May 10, 2024, 3:30am

I don’t think this is true.
The open source Wilkson-prize-winning sponsored-by-AMD BLIS project does this, but I have not seen MKL do this.

julia> using BLASBenchmarksCPU

julia> BLASBenchmarksCPU.blis_set_num_threads(32); BLASBenchmarksCPU.mkl_set_num_threads(32); BLASBenchmarksCPU.openblas_set_num_threads(32);

julia> M = K = N = 15; T=Float64; A = rand(T,M,K); B = rand(T,K,N); C = Matrix{T}(undef,M,N);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  11.590 μs (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  978.333 ns (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  594.464 ns (0 allocations: 0 bytes)

julia> BLASBenchmarksCPU.blis_set_num_threads(1); BLASBenchmarksCPU.mkl_set_num_threads(1); BLASBenchmarksCPU.openblas_set_num_threads(1);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  1.053 μs (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  975.000 ns (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  586.883 ns (0 allocations: 0 bytes)


julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  JULIA_PATH = @.
  JULIA_NUM_THREADS = 64

OpenBLAS did outperform MKL though, which I don’t think is likely to happen on an Intel system.

BLASBenchmarksCPU is a little heavy on the dependencies.
I just use it to compare all three libraries side by side. If I used it more often, I’d move the library-wrapping to a separate repo. Anyone else is more than welcome to do that. It’d mostly be a copy/paste job (or, mostly cp a file).

meisel · May 10, 2024, 7:22pm

What are you seeing for larger matrix multiplications on single thread? E.g., 5,000x100 by 100x5,000 . This is a more relevant use case for me

Elrod · May 11, 2024, 8:34am

Elrod:

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  11.590 μs (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  978.333 ns (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  594.464 ns (0 allocations: 0 bytes)

julia> using BLASBenchmarksCPU

julia> M, K, N = 5_000, 100, 5_000; T=Float64; A = rand(T,M,K); B = rand(T,K,N); C = Matrix{T}(undef,M,N);

julia> BLASBenchmarksCPU.blis_set_num_threads(1); BLASBenchmarksCPU.mkl_set_num_threads(1); BLASBenchmarksCPU.openblas_set_num_threads(1);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  94.387 ms (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  113.861 ms (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  113.873 ms (0 allocations: 0 bytes)

BLIS wins here.

BLIS loses at small sizes because they don’t care about them at all, and thus I don’t think they have any fast paths to avoid some of the overheads.

Topic		Replies	Views
MKL slower than openblas in intel cpu Performance mkl , linearalgebra	7	3505	March 2, 2022
OpenBLAS vs MKL General Usage mkl	7	15770	January 16, 2020
What is the current state of multi-threaded BLAS in Julia? Performance blas	8	1406	March 23, 2024
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36472	June 19, 2020
Using MKL in 1.7 Performance package	11	1618	December 8, 2021

Is MKL performance on AMD no longer crippled?

Related topics