OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Thanks, added them.

Personally, I’ve been using Julia built with OpenBLAS because I got tired of issues such as ARPACK.jl not working. The _jlls are convenient for benchmark scripts though, because (a) they mean I don’t need to build Julia with MKL, and (b) someone else can run the script without me needing to make any assumptions or checks for what BLAS.vendor() returns.

I would consider using MKL_jll as a dependency in my own libraries though, because its performance (especially multi-threaded) is remarkable. However, as I assume it doesn’t work for ARM, that would force me to do a lot of special casing. And as Apple is moving to ARM, we’ll soon be seeing a lot more of it.

If you want to estimate flops:

M = K = N = 16_000;
A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);
time = @elapsed dgemmkl!(C, A, B);
2e-9M * K * N / time

Would yield GFLOPS.

Using your reported times with 1_000x1_000 matrices and 8 threads:

julia> M = K = N = 1000
1000

julia> 2e-9M * K * N / 7.374e-3
271.2232167073502

julia> 2e-9M * K * N / 9.142e-3
218.7705097352877

Versus 431, which you saw with your Julia 1.4 installed via apt. It should continue scaling as the matrices increase in size.

The theoretical peak of your CPU is:

julia> GHz = 4.25 # clock cycles per nanosecond
4.25

julia> ops_per_fma = 8 # AVX2 means 4 additions and 4 multiplications with double precision
8

julia> instr_per_clock = 2 # 2 fma per clock cycle
2

julia> cores = 8
8

julia> GHz * ops_per_fma * instr_per_clock * cores
544.0

BLAS should be able to get fairly close to the theoretical peak.

Also, this lets us conform that MKL does seem to be using AVX2 with your Ryzen:

julia> 2e-9M * K * N / 34.320e-3 # 1000x1000 flops
58.275058275058285

julia> 4.25 * 8 * 2 # theoretical peak for 1 core
68.0
5 Likes