Asymptotically faster matrix-multiplication algorithms

GunnarFarneback · March 27, 2023, 2:57pm

Does there even exist an actual implementation?

mstewart · March 27, 2023, 3:10pm

Does there even exist an actual implementation?

I wouldn’t have known, but google did turn up Matrix-Multiplication. From the descriptions I’ve heard, I always assumed that the break-even point would be for some matrix that is so large that I wouldn’t have a computer with that much RAM.

GunnarFarneback · March 27, 2023, 3:29pm

Your search karma seems better than mine. I gave up after finding a reference to the original paper where it was noted that the authors state (start of section 5):

Previous authors in this field have exhibited their algorithms directly, but we will have to rely on hashing and counting arguments to show the existence of a suitable algorithm.

Jean_Michel · March 28, 2023, 7:55am

Coppersmith-Winograd is certainly a bad idea to implement, but what about Strassen? Would it beat
the plain method for sizes a few hundred?

Oscar_Smith · March 28, 2023, 12:04pm

the strassen reloaded papers put the cutoff closer to 1000 and they were benchmarking vs worse regular implementions. I would be interested in someone trying strassen on Octavian

mstewart · March 28, 2023, 12:21pm

Yes. It’s dependent on the hardware and the details of the implementation, but Strassen’s algorithm (and the Winograd variant of Strassen’s algorithm that trims away some additions) can have break-even points in that range. Doing a step or two of Strassen’s algorithm before using a conventional BLAS call can speed things up on large matrices. People who really want to multiply large dense matrices do that sometimes.

But I don’t think it has ever made it into a commonly used BLAS library, mostly for practical reasons: You need extra storage. It is less effective for products involving tall and wide matrices. It has weaker stability guarantees.

Speculating a little bit, it also doesn’t seem entirely clear to me that it would be easy to get big benefits in other dense matrix computations by using a Strassen BLAS. LAPACK uses its own block sizes that are typically less than the sizes that would be strong candidates for Strassen. (On my machine, unless I messed something up, an attempted ccall of LAPACK’s ilaenv returns 64 for the block size LU factorization with OpenBLAS). The calculation of optimal block sizes would presumably change if Strassen’s algorithm were available. But there could be some places where that increases overall work in some secondary computation that was previously negligible for modest block sizes. (computation of triangular factors associated with block Householder transformations using xLARTF seems like a candidate for something that becomes more costly with increasing block size.) That could make it harder to break even or do better than standard BLAS. I’m speculating and could be wrong, but using Strassen for such things doesn’t look trivial to me.

The report Implementing Strassen’s Algorithm with BLIS makes the case that some of these concerns can be handled. They are getting good results for matrix multiplication of large matrices. I’d still be skeptical about broader use in dense matrix computations, however.

stevengj · March 28, 2023, 12:23pm

To have more than marginal (> 10%) speedup they need a few thousand, and to have > 20% speedup they need sizes \gtrsim 10^4. This seems pretty consistent with other experiments I’ve seen over the years, e.g. with ATLAS BLAS in 2006.

See also this discussion: Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes." - #7 by stevengj

ChrisRackauckas · March 28, 2023, 1:17pm

IIRC, some of the issues with the algorithms are decreased numerical stability though, and that’s something that also needs to be factored into the discussion. Faster without accuracy isn’t necessarily faster. But it would be interesting to have a good Julia implementation to play around with and benchmark in real applications.

mstewart · March 28, 2023, 1:53pm

Specifically, for Strassen you lose componentwise bounds. For C=AB, ordinary matrix multiplication gives computed \hat{C} with

|C-\hat C| \leq nu |A| |B| + O(u^2).

For the original Strassen algorithm, the bounds are normwise and of the form

\|C-\hat C\| \leq f_n u \|A\| \|B\|+O(u^2)

for some f_n that has a larger constant factor but is not really dramatically more quickly growing than a normwise bound would be for conventional multiplication, especially if recursion isn’t going too deep before you apply conventional multiplication. The Winograd variant with fewer additions doesn’t necessarily satisfy the normwise bound unless you introduce some scaling. Unless you are specifically worried about componentwise errors, this is probably still fine for many uses. There is a very nice chapter in N. Higham’s “Accuracy and Stability of Numerical Algorithms” that covers the stability of these algorithms, which is where I pulled this from.

stevengj · March 28, 2023, 1:56pm

See also “Fast matrix multiplication is stable” (Demmel et al, 2006).

fph · March 28, 2023, 9:47pm

Despite the name, at a quick look that implementation seems to be the Winograd form of Strassen’s method, not the Coppersmith-Winograd O(n^{2.3755}) algorithm from 1990, which is significantly more complex.

mstewart · March 28, 2023, 11:05pm

Good point. So I guess it was mislabeled and I fell for it. I did look briefly at the implementation and was surprised it didn’t look more complicated. I was also pondering the point made by @GunnarFarneback that the original paper looks more like an existence proof than a full specification. So now I’m circling back to wondering if the algorithm has even been fully described anywhere. I suppose that if you go much past Strassen’s algorithm an existence proof is as useful as a fully optimized implementation.

fph · March 29, 2023, 6:20am

I also do not know all the details (and those I know are very rusty), but from what I understand it is very hard to turn the mathematics into an actual implementation of CW or any later algorithm. The proof is indeed an existence proof, proving with probabilistic methods the existence of a sequence of algorithms (for increasing sizes n) that attain the desired complexity.

pitsianis · March 29, 2023, 3:48pm

I wanted to add it here for completeness.

“Galactic” algorithms, a term coined by Regan, are algorithms 
with good-looking asymptotic running times but concrete costs 
so high as to prevent their practical use on scales smaller than 
the universe.

from Lipton, R.J., Regan, K.W. (2013). David Johnson: Galactic Algorithms. In: People, Problems, and Proofs. Springer, Berlin, Heidelberg

Topic		Replies	Views
Algorithmic complexity of matrix multiplication in Julia Performance question	10	1958	March 20, 2021
Where to implement new matrix multiplication algorithms? Optimization (Mathematical)	3	140	June 1, 2025
Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes." Offtopic linearalgebra	6	3342	May 30, 2025
Faster matmul Numerics	3	483	March 9, 2024
GEMM kernel, cache oblivious, and Strassen Performance linearalgebra	3	299	March 23, 2025

Asymptotically faster matrix-multiplication algorithms

Related topics