Poor openBLAS performance for large matrix multiply?

julia> strip(unsafe_string(ccall((BLAS.@blasfunc(openblas_get_config), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.23  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=512"

It is using Cooperlake, which has AVX512, for my 9950X.
As an aside, does OpenBLAS have AVX512_BF16 support? Cooperlake and Zen4/5 support it.

N=449*10*2;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1619.8001922112328

1.6 TFLOPS.
I think it should be able to do around 2.25 TFLOPS.
Without AVX512, it should be limited to around half that, 1.25 TFLOPS.

I used much smaller matrices than OP.

julia> N = 499*10^2;

julia> 2e-9N^3 / 177.625
1019.2129373680507

This is quite poor performance from MKL.
I also see worse performance from MKL compared to OpenBLAS.

@Oscar_Smith I think you meant to tag me.

Octavian is doing a few things sub-optimally, so I don’t recommend the package in general, but it does happen to do better than OpenBLAS for me.
Much better than MKL (which is maybe AVX2 only?), and especially BLIS (single threaded, nehalem only?).

FWIW, I also tried the full-sized problem with Octavian:

julia> using Octavian

julia> N = 449*10^2;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> @time matmul!(C,A,B);
102.650004 seconds (14.76 M allocations: 818.721 MiB, 0.17% gc time, 3.97% compilation time)

julia> 2e-9N^3 / 102.65
1763.6405065757428

I should really have precompiled first, that should’ve gotten it <100s. Still, far away from >2TFLOPS.

I got the 2TFLOPS estimate from:

julia> 4.4 * 16 * 16 * 2
2252.8

That is 4.4 cycles/ns * 16 cores * 16 ops/fma * 2fma/cycle (cycles/ns == GHz clock speed, I’m assuming heavy downclocking).

I hope to have an alternative that obsoletes Octavian in a year or so, but if anyone is up for fixing Octavian itself in the man time, I can answer questions.

4 Likes