OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

The interesting question, in my opinion, is whether AMD Ryzen + OpenBLAS can compete against Intel + Intel MKL for similar budget.

Just tested on a 7900x, which costs about as much as the Ryzen Threadripper 1950x and was released at around the same time.
The major differences are 10 cores / 20 threads vs 16 / 32, and two 512 bit fma units vs two 128 bits units per core.

Using MKL:

julia> A = randn(4096,4096);

julia> C = randn(4096,4096);

julia> B = randn(4096,4096);

julia> using LinearAlgebra,  BenchmarkTools

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     174.715 ms (0.00% GC)
  median time:      176.507 ms (0.00% GC)
  mean time:        176.510 ms (0.00% GC)
  maximum time:     178.155 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

Over twice as fast as the Threadripper! At the high $ end, Intel and its avx advantage win hands down for matrix multiplication.
And this was with 4096 x 4096 matrices. It was over three times faster with 64x64.

In trying to optimize 8x8 by 8x8 matrix multiplication on the two processors, my best times were about 48 ns vs 11.6 ns for Ryzen vs avx-512.

Maybe I should build Julia with OpenBLAS to see how it compares. I’d expect it to be faster at all sizes too, but it could be that at the small end Ryzen + MKL might be faster than Intel + OpenBLAS.

In my original post, I should have paid more attention to how many threads each were using. I realized OpenBLAS gets faster with BLAS.set_num_threads(1) for some sizes; I think they turn on multithreading too soon.

EDIT:
Does Julia + OpenBLAS launch with less threads now? I thought I had quite the regression, but BLAS.set_num_threads(16) fixed it. Threadripper:

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     369.510 ms (0.00% GC)
  median time:      370.037 ms (0.00% GC)
  mean time:        389.715 ms (0.00% GC)
  maximum time:     622.606 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1
1 Like