Poor openBLAS performance for large matrix multiply?

gluon-free · September 12, 2024, 10:38pm

I am new to Julia and not an experienced programmer. Playing with things and trying simple benchmarks for my new Ryzen 9950x. This one rises a question:

using MKL
using CUDA, LinearAlgebra, Pkg, BenchmarkTools
 let N = 449*10^2
           A = randn(N, N)
           B = randn(N, N)
           C = zeros(N, N)
            @btime mul!($C, $A, $B)
        end;

Mkl library provides fair performance with about 54 GB ram usage and more than half of CPU threads utilization. Output:
177.625 s (0 allocations: 0 bytes)
Total cell evaluation is 545 s.
But using openBLAS for this problem takes forever to complete, it only use one core!

What do I not understand about openBLAS? How to force it for better utilization?
Does MKL use AVX-512 here and how to verify it? And are we see AOCL.jl for AMD CPU’s in near future?

Oscar_Smith · September 12, 2024, 10:57pm

It’s possible that OpenBlas doesn’t know what to do with zen5 and is just throwing it a generic codepath. @celrods24 do you know what’s going on here?

giordano · September 12, 2024, 11:06pm

Set the environment variable OPENBLAS_VERBOSE=2 before starting julia, to see what coretype is OpenBLAS using. Alternatively (but it’s more verbose):

julia> using LinearAlgebra, OpenBLAS_jll

julia> strip(unsafe_string(ccall((BLAS.@blasfunc(openblas_get_config), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.23  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=512"

Also, check with LinearAlgebra.BLAS.get_num_threads() the number of threads when you’re running the code with OpenBLAS vs MKL, in case there’s any difference, and if so you can change the number of threads with LinearAlgebra.BLAS.set_num_threads.

Elrod · September 12, 2024, 11:09pm

julia> strip(unsafe_string(ccall((BLAS.@blasfunc(openblas_get_config), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.23  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=512"

It is using Cooperlake, which has AVX512, for my 9950X.
As an aside, does OpenBLAS have AVX512_BF16 support? Cooperlake and Zen4/5 support it.

N=449*10*2;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1619.8001922112328

1.6 TFLOPS.
I think it should be able to do around 2.25 TFLOPS.
Without AVX512, it should be limited to around half that, 1.25 TFLOPS.

I used much smaller matrices than OP.

julia> N = 499*10^2;

julia> 2e-9N^3 / 177.625
1019.2129373680507

This is quite poor performance from MKL.
I also see worse performance from MKL compared to OpenBLAS.

@Oscar_Smith I think you meant to tag me.

Octavian is doing a few things sub-optimally, so I don’t recommend the package in general, but it does happen to do better than OpenBLAS for me.
Much better than MKL (which is maybe AVX2 only?), and especially BLIS (single threaded, nehalem only?).

FWIW, I also tried the full-sized problem with Octavian:

julia> using Octavian

julia> N = 449*10^2;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> @time matmul!(C,A,B);
102.650004 seconds (14.76 M allocations: 818.721 MiB, 0.17% gc time, 3.97% compilation time)

julia> 2e-9N^3 / 102.65
1763.6405065757428

I should really have precompiled first, that should’ve gotten it <100s. Still, far away from >2TFLOPS.

I got the 2TFLOPS estimate from:

julia> 4.4 * 16 * 16 * 2
2252.8

That is 4.4 cycles/ns * 16 cores * 16 ops/fma * 2fma/cycle (cycles/ns == GHz clock speed, I’m assuming heavy downclocking).

I hope to have an alternative that obsoletes Octavian in a year or so, but if anyone is up for fixing Octavian itself in the man time, I can answer questions.

gluon-free · September 12, 2024, 11:53pm

I dont know if this CPU hack to mimic Cooper Lake actually works because after it openblas still using 1 thread, but using

LinearAlgebra.BLAS.set_num_threads(32)

worked!
Lesson learned, for everyone who face this problem - set number of threads manually!!!

gluon-free · September 12, 2024, 11:55pm

I got almost 1.5 TFlops, omg, my 4090 got beaten! And this N is actually maximum matrix dimension 24 GB of VRAM can handle so i can go further and faster on 9950x, amazing!
The interesting thing, MKL doesnt allow me to set threads above 16:

giordano · September 13, 2024, 12:02am

I don’t know if pluto does something to affect the number of BLAS threads, but by default Julia should spin a number of openblas threads equal to half the number of logical threads on your machine:

github.com

JuliaLang/julia/blob/e52a46c5c192bfe16853ae0b63ac33b280fba063/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L845-L852


      
          # https://github.com/xianyi/OpenBLAS/blob/c43ec53bdd00d9423fc609d7b7ecb35e7bf41b85/README.md#setting-the-number-of-threads-using-environment-variables
          if !haskey(ENV, "OPENBLAS_NUM_THREADS") && !haskey(ENV, "GOTO_NUM_THREADS") && !haskey(ENV, "OMP_NUM_THREADS")
              @static if Sys.isapple() && Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"
                  BLAS.set_num_threads(max(1, @ccall(jl_effective_threads()::Cint)))
              else
                  BLAS.set_num_threads(max(1, @ccall(jl_effective_threads()::Cint) ÷ 2))
              end
          end

If you’re getting only one out-of-the-box there may be something else going on.

Elrod · September 13, 2024, 12:02am

Your 4090 would be much faster than your 9950X for Float32, and especially for data types that let it use its tensor cores, such as BFloat16 (which both support) or Float16 (which only the 4090 supports).

I dont know if this CPU hack to mimic Cooper Lake actually works because after it openblas still using 1 thread, but using

For me, it automatically used 16 threads. I wasn’t using Pluto.
You may want to try 16, as 1 thread per core normally does better.
16 did better than 32 in my tests, even on Zen5, with it’s 1 decoder per thread.
This is because block sizes assume threads have access to the entire cache.

OpenBLAS treating it as cooper-lake matters for the kernels it uses (e.g., using AVX512) and block sizes (which should be a function of cache sizes; Cooperlake and Zen5 have the same sized L2 cache and both are 16-way associative, but Zen5 has 50% larger L1D, and far larger L3 per core; still, cooperlake is a good choice).

gluon-free · September 13, 2024, 12:29am

I believe Zen memory bandwidth bottleneck will never allow you to get 2 TFlops on CPU. Wait for CoWoS packaging to throw away shtty Infinity Fabric and garbage distant memory controller. Y-Cruncher dev has amazing article about Zen5 AVX-512 implementation and bottlenecks:
http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/

gluon-free · September 13, 2024, 12:45am

Problem with Pluto looks like the case, i run through terminal:

And without hyper threading it performed better.
So looks like Pluto should not be used by noobies.

giordano · September 13, 2024, 12:49am

Yes: [OpenBLAS_jll] Update to new build with BFloat16 kernels by giordano · Pull Request #53059 · JuliaLang/julia · GitHub. This is in Julia v1.11.

Elrod · September 13, 2024, 1:09am

I agree, that’s a great article.

However, I think we should be able to get 2 TFLOPS.
Good cache tiling reduces memory bandwidth requirements.
My 10980XE gets 2 TFLOPS, and its RAM bandwidth is only slightly better (4 channels of DDR4 vs 2 channels of DDR5), while its L3 has both much lower bandwidth and capacity than the 9950X’s.

Octavian.jl does better than OpenBLAS for me, even though it stupidly requires more memory bandwidth from matrix A than necessary, and its cache blocking algorithm is also sub-optimal.
I’m confident I’ll hit >2TFLOPs with my new approach.

stevengj · September 13, 2024, 1:48am

The intuition here is that m \times m matrix multiplication does \Theta(m^3) work for \Theta(m^2) data, so if you arrange the data accesses correctly then it’s usually been possible for it to be compute-bound rather than memory-bound for large m.

Elrod · September 13, 2024, 2:10am

Unfortunately, the memory bandwidth requirement is still \Theta(m^3).
For a given cache level, you can have 3 submatrices inside, and calculate C_{m_c,n_c} = C_{m_c,n_c} + A_{m_c,k_c} * B_{k_c,n_c}.

The issue is that if we iterate over slice-subscripts m_c,n_c,k_c, we swap out submatrices a total of \text{cld}(M,m_c)\text{cld}(N,n_c)\text{cld}(M,m_c) times.
We generally want to iterate over just one loop at a time, so that we can at least keep one of the three arrays in that cache level (i.e., the array not indexed by that loop).
There are some tricks we can do to improve efficiency, but we still ultimately need \Theta(m^3) bandwidth, we’re just shrinking the constant.

As an aside, you’ve often pointed out the need for non-square tiles. That should be easy to see here: every iteration of whichever loop is inner-most among m_c,n_c,k_c swaps out two out of three arrays. Therefore, we do not get any reuse for two out of three arrays! We thus often want the slice of the array we do actually keep to be the largest of the three [with at least one notable exception].
We have multiple levels of caches (of different sizes), and have a different loop that does the swap outs for each, meaning different levels keep different arrays.
Solving to do this optimally probably won’t often produce square tiles.

This is a simplified view. There are some complications, as well as some tricks that can improve the situation.

EDIT:
I personally found these two articles very helpful:

https://dl.acm.org/doi/pdf/10.1145/3445814.3446759

stevengj · September 13, 2024, 2:47pm

Right, there’s a proof by Hong and Kung (1981) that the optimal number of cache misses (for an ideal cache of size Z) is \Theta(m^3/\sqrt{Z}) (for the cubic-complexity multiplication algorithm). But the reason why you can save such a large factor \sqrt{Z} is because there is so much work per data. Indeed, you can obtain this complexity from blocking by showing that each block can do \Theta(Z^{3/2}) work for Z data. And this factor of \sqrt{Z} is usually big enough to hide the memory latency. .

Elrod · October 5, 2024, 2:46am

https://arxiv.org/pdf/1904.05717 is the best article I’ve seen so far on matrix cache tiling.
They show that square cache tiles are optimal, my oversight earlier was that you can just add extra tiling loops to achieve this.
They develop a new algorithm that does better than Goto’s, which is what OpenBLAS uses:
20241004_22h34m58s_grim
Note that going from 2 channels of DDR4-3200 to 1 channel of DDR4-800 cut GFLOPS in half, from >200 to around 120 for B3A2C0, or 100 for Goto’s (which is A2C0, by their naming convention).

The 9950X has at least 8x the FLOPS, but at best 2x the memory bandwidth if the i7-7700K.
In terms of memory bandwidth vs FLOPS, the 9950X is thus roughly comparable to the 7700K with 1-channel of DDR4 1600 (or between that and 1200). In those tests, it is starting to fall behind 2-channel DDR4 3200.

The 9950X does, however, have larger caches, so that can mitigate it this – I expect/hope well enough to hit 2 TFLOPS with fp64.
But memory bandwidth is most definitely a serious issue CPUs.

Oscar_Smith · October 5, 2024, 5:44am

That’s an awesome paper! I now really want the Octavian2 that does automatic packing selection on top of good automatically generated microkernels.

PetarM · April 4, 2025, 11:03am

Hi! Any update on this? Did you manage to get better performance in the meantime?

Based on what I read (Zen5's AVX512 Teardown + More...), the 9950X will down clock to 3.8 GHz in AVX512 heavy workload. Therefore, the maximum performance you could expect would be 1.95 TFLOPS.

I am curious how much do you get in matrix vector multiply, where you will be memory bound? I am running some benchmarks on my Mac Studio M4 Max and while the double precision TFLOPS are just a third what you get on the 10980XE, it was close to being 3x faster in matrix vector multiply, based on the numbers you shared in a different thread.

The M4 Max exonerates itself a bit in single precision, where it is able to offload the computation to its 2 AMX engines: I got 3.4 TFLOPS in Julia on dense matrix multiply (35W from the wall plug), and that’s with just 2 P-cores loaded and the other 10 idling. Theoretically, for a hybrid NEON/AMX implementation it could reach slightly over 4 TFLOPS, which I was able to achieve by multiplying 2 matrices in parallel (one with 10P cores, the other with 2 AMX engines/2P cores).

Topic		Replies	Views
MKL slower than openblas in intel cpu Performance mkl , linearalgebra	7	3423	March 2, 2022
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36279	June 19, 2020
OpenBLAS vs MKL General Usage mkl	7	15705	January 16, 2020
What is the current state of multi-threaded BLAS in Julia? Performance blas	8	1255	March 23, 2024
OpenBLAS much slower on Windows than Linux Performance windows , blas , linearalgebra , openblas	4	1396	July 23, 2022

Poor openBLAS performance for large matrix multiply?

Related topics