Thanks, added them.
Personally, I’ve been using Julia built with OpenBLAS because I got tired of issues such as ARPACK.jl
not working. The _jll
s are convenient for benchmark scripts though, because (a) they mean I don’t need to build Julia with MKL, and (b) someone else can run the script without me needing to make any assumptions or checks for what BLAS.vendor()
returns.
I would consider using MKL_jll
as a dependency in my own libraries though, because its performance (especially multi-threaded) is remarkable. However, as I assume it doesn’t work for ARM, that would force me to do a lot of special casing. And as Apple is moving to ARM, we’ll soon be seeing a lot more of it.
If you want to estimate flops:
M = K = N = 16_000;
A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);
time = @elapsed dgemmkl!(C, A, B);
2e-9M * K * N / time
Would yield GFLOPS.
Using your reported times with 1_000
x1_000
matrices and 8 threads:
julia> M = K = N = 1000
1000
julia> 2e-9M * K * N / 7.374e-3
271.2232167073502
julia> 2e-9M * K * N / 9.142e-3
218.7705097352877
Versus 431
, which you saw with your Julia 1.4 installed via apt
. It should continue scaling as the matrices increase in size.
The theoretical peak of your CPU is:
julia> GHz = 4.25 # clock cycles per nanosecond
4.25
julia> ops_per_fma = 8 # AVX2 means 4 additions and 4 multiplications with double precision
8
julia> instr_per_clock = 2 # 2 fma per clock cycle
2
julia> cores = 8
8
julia> GHz * ops_per_fma * instr_per_clock * cores
544.0
BLAS should be able to get fairly close to the theoretical peak.
Also, this lets us conform that MKL does seem to be using AVX2 with your Ryzen:
julia> 2e-9M * K * N / 34.320e-3 # 1000x1000 flops
58.275058275058285
julia> 4.25 * 8 * 2 # theoretical peak for 1 core
68.0