Is MKL performance on AMD no longer crippled?

I recently tried MKL 2024.1 on an AMD cloud instance (1 vCPU on top of a 4th gen AMD EPYC CPU), and found it to have very reasonable performance. In fact, it was better than another matmul library I tried. Has Intel finally made MKL reasonably fast on AMD processors without needing hacks like MKL_DEBUG_CPU_TYPE=5? It would save me a lot of trouble if so.

I believe it’s single threaded performance is normal, but if you are using multiple threads, it will do intentionally dumb stuff to use all the threads even if the matrix is way too small for that to be a good idea.

I don’t think this is true.
The open source Wilkson-prize-winning sponsored-by-AMD BLIS project does this, but I have not seen MKL do this.

julia> using BLASBenchmarksCPU

julia> BLASBenchmarksCPU.blis_set_num_threads(32); BLASBenchmarksCPU.mkl_set_num_threads(32); BLASBenchmarksCPU.openblas_set_num_threads(32);

julia> M = K = N = 15; T=Float64; A = rand(T,M,K); B = rand(T,K,N); C = Matrix{T}(undef,M,N);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  11.590 μs (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  978.333 ns (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  594.464 ns (0 allocations: 0 bytes)

julia> BLASBenchmarksCPU.blis_set_num_threads(1); BLASBenchmarksCPU.mkl_set_num_threads(1); BLASBenchmarksCPU.openblas_set_num_threads(1);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  1.053 μs (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  975.000 ns (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  586.883 ns (0 allocations: 0 bytes)


julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  JULIA_PATH = @.
  JULIA_NUM_THREADS = 64

OpenBLAS did outperform MKL though, which I don’t think is likely to happen on an Intel system.

BLASBenchmarksCPU is a little heavy on the dependencies.
I just use it to compare all three libraries side by side. If I used it more often, I’d move the library-wrapping to a separate repo. Anyone else is more than welcome to do that. It’d mostly be a copy/paste job (or, mostly cp a file).

3 Likes

What are you seeing for larger matrix multiplications on single thread? E.g., 5,000x100 by 100x5,000 . This is a more relevant use case for me

julia> using BLASBenchmarksCPU

julia> M, K, N = 5_000, 100, 5_000; T=Float64; A = rand(T,M,K); B = rand(T,K,N); C = Matrix{T}(undef,M,N);

julia> BLASBenchmarksCPU.blis_set_num_threads(1); BLASBenchmarksCPU.mkl_set_num_threads(1); BLASBenchmarksCPU.openblas_set_num_threads(1);

julia> @btime BLASBenchmarksCPU.gemmblis!($C, $A, $B);
  94.387 ms (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmmkl!($C, $A, $B);
  113.861 ms (0 allocations: 0 bytes)

julia> @btime BLASBenchmarksCPU.gemmopenblas!($C, $A, $B);
  113.873 ms (0 allocations: 0 bytes)

BLIS wins here.

BLIS loses at small sizes because they don’t care about them at all, and thus I don’t think they have any fast paths to avoid some of the overheads.

3 Likes