Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes."

Palli · October 8, 2022, 5:36pm

@Elrod I believe you and others (OpenBLAS, MKL etc) use the Strassen algorithm, which has now been practically improved on.

TL;DR, multiplying 10-20% faster is considered a breakthrough, now achieved:

we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

This means that OpenBLAS, MKL, and your (old and new) code is outdated, even for the 4x5 by 5x5 matrix multiply where I calculate 76/80 = 5% improvement, and 4% for 4x4. Also should some static array size code be special-cased, e.g. for those sizes?

[…] Our system, AlphaTensor, builds upon AlphaZero, an agent that has shown superhuman performance on board games, like chess, Go and shogi, and this work shows the journey of AlphaZero from playing games to tackling unsolved mathematical problems for the first time.

Matrix multiplication

Matrix multiplication is one of the simplest operations in algebra, commonly taught in high school maths classes. […]

Despite decades of research following Strassen’s breakthrough, larger versions of this problem have remained unsolved – to the extent that it’s not known how efficiently it’s possible to multiply two matrices that are as small as 3x3. […]

In our paper, we explored how modern AI techniques could advance the automatic discovery of new matrix multiplication algorithms. Building on the progress of human intuition, AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes. Our AI-designed algorithms outperform human-designed ones, which is a major step forward in the field of algorithmic discovery.

The process and progress of automating algorithmic discovery

First, we converted the problem of finding efficient algorithms for matrix multiplication into a single-player game. In this game, the board is a three-dimensional tensor (array of numbers), capturing how far from correct the current algorithm is.
[…]
For example, if the traditional algorithm taught in school multiplies a 4x5 by 5x5 matrix using 100 multiplications, and this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications.

Beyond this example, AlphaTensor’s algorithm improves on Strassen’s two-level algorithm in a finite field for the first time since its discovery 50 years ago. These algorithms for multiplying small matrices can be used as primitives to multiply much larger matrices of arbitrary size.
[…]
Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

Exploring the impact on future research and applications

From a mathematical standpoint, our results can guide further research in complexity theory, which aims to determine the fastest algorithms for solving computational problems. By exploring the space of possible algorithms in a more effective way than previous approaches, AlphaTensor helps advance our understanding of the richness of matrix multiplication algorithms. Understanding this space may unlock new results for helping determine the asymptotic complexity of matrix multiplication, one of the most fundamental open problems in computer science.

Because matrix multiplication is a core component in many computational tasks, spanning computer graphics, digital communications, neural network training, and scientific computing, AlphaTensor-discovered algorithms could make computations in these fields significantly more efficient. AlphaTensor’s flexibility to consider any kind of objective could also spur new applications for designing algorithms that optimise metrics such as energy usage and numerical stability, helping prevent small rounding errors from snowballing as an algorithm works.

While we focused here on the particular problem of matrix multiplication, we hope that our paper will inspire others in using AI to guide algorithmic discovery for other fundamental computational tasks. Our research also shows that AlphaZero is a powerful algorithm that can be extended well beyond the domain of traditional games to help solve open problems in mathematics. Building upon our research, we hope to spur on a greater body of work – applying AI to help society solve some of the most important challenges in mathematics and across the sciences.

AlphaTensor is not mentioned at:

Since the bound on omega 2.8074 of Strassen has already been improved, even to 2.3728596 in 2020, but none of the newer algorithms have been practical, because of a large constant factor.

Freivalds’ algorithm, a simple Monte Carlo algorithm that, given matrices A, B and C, verifies in Θ(n²) time if AB = C.

That’s only to verify (is it commonly used?), and omega of 2 is the lower-bound to multiply.

https://www.nature.com/articles/s41586-022-05172-4

Here we report a deep reinforcement learning approach based on AlphaZero for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. Our agent, AlphaTensor, is trained to play a single-player game where the objective is finding tensor decompositions within a finite factor space. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago
[…]
see Fig. 1a for how to represent the 2 × 2 matrix multiplication operation as a 3D tensor of size 4 × 4 × 4
[…]
AlphaTensor finds an algorithm for multiplying 4 × 4 matrices using 47 multiplications in Z2, thereby outperforming Strassen’s two-level algorithm, which involves 7² = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z2 with complexity O(N^2.778)

There are 4 independent directories:

algorithms contains algorithms discovered by AlphaTensor, represented as factorizations of matrix multiplication tensors, and a Colab showing how to load these.

benchmarking contains a script that can be used to measure the actual speed of matrix multiplication algorithms on an NVIDIA V100 GPU.

nonequivalence contains 14,236 nonequivalent algorithms discovered by AlphaTensor for the same matrix multiplication problem (multiplying 4x4 matrices), and a Colab that verifies their nonequivalence.

recombination contains the code we used to decompose larger matrix multiplication tensors by recombining factorizations of smaller ones.

stevengj · October 8, 2022, 9:15pm

It’s been known for years that you could get marginal speedups (e.g. 10–20%) for matrix multiplications by applying 1 step of Strassen for matrix sizes on the order of 10000x10000, and this paper claims you can do a bit better (e.g. 11% speedup instead of 7% speedup for 10000x10000) with their new variant. This is a nice result! (Though it doesn’t improve on the asymptotic complexity of matrix multiplication.)

However, none of the popular BLAS libraries use Strassen, nor have they ever done so AFAIK. The reason, I think, is that even 20% speedups are not worth the price of allocating the auxiliary matrices required by Strassen-like algorithms (especially since the BLAS interface doesn’t provide any way for the user to pass in auxiliary buffers to re-use). (And much bigger speedups only start to appear for much larger matrices where people no longer typically use dense linear algebra.)

I suspect that, for the same reasons, the new AlphaTensor algorithms won’t be widely used in practice.

No. This is not actually faster for small matrices of scalars. The speedups are for “small” matrices over fields/rings where multiplication is vastly more expensive than addition — e.g. for for 4x4 matrices where the “elements” of each matrix are themselves large matrices, i.e. block matrices. That’s why in practice these algorithms become competitive only for matrices of a few thousand rows/columns (of scalars).

See also this review of the AlphaTensor result.

jling · October 8, 2022, 11:57pm

from what I can gather there’s some big asterisk,

I’m quite suspicious about their hardware benchmarks. They’re not writing custom kernels, they’re relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

they “didn’t feel like” need to write a CUDA kernel, so it’s pure XLA
we know XLA can be worse than optimal in some cases

→ it’s possible they just found a way to organize Jax operation so XLA can compile better code…

photor · October 9, 2022, 6:10am

Noticed that news. Looks great.

stevengj · October 10, 2022, 11:59am

That’s the conventional thing to do, I think. For these type of algorithms, a key practical question is whether a single divide-and-conquer step is faster than highly optimized library X. That is, you often want to compare matrix multiplication with X to matrix multiplication where you break into K submatrices (e.g. K=2 or K=4) and call X on the submatrices.

On the other hand, this 2017 paper argues that you can do better by customizing the lower-level kernels to merge them with some of the Strassen-like matrix additions.

stevengj · October 10, 2022, 12:06pm

I should note that there was a 2016 paper that argued otherwise, claiming that optimized matrix-multiplication libraries already use such buffers internally—at least in the context of libflame and BLIS—and that the same buffers can be used for Strassen.

So, it may be that there are some practical BLIS-based linear-algebra libraries that currently use Strassen for sufficiently large sizes, e.g. ffpack and tblis, and these may be fruitful to update to use the new AlphaTensor-discovered algorithms.

photor · May 30, 2025, 9:52am

Now, on an AMD EPYC 7K83 platform, I can achieve 10% speedup for 480x480 square matmul with 1-step Strassen (together with a simple matmul kernel written in LV.jl), 15% for 960x960 with 2-step Strassen, 20% for 1920x1920 with 3-step Strassen.
Oh, forgot to mention my results are single-threaded. I don’t have the expertise to optimize the multi-threaded code for Strassen, or Strassen is not so good for the multi-threaded case.

Topic		Replies	Views
New 4x4 algorithm found Offtopic	28	3641	May 23, 2025
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6852	December 14, 2018
How much faster is GPU compare to CPU GPU	16	27009	November 24, 2018
Algorithmic complexity of matrix multiplication in Julia Performance question	10	2025	March 20, 2021
Why mul! is so fast? General Usage question , linearalgebra	7	6952	November 26, 2019

Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes."

Matrix multiplication

The process and progress of automating algorithmic discovery

Exploring the impact on future research and applications

Related topics