New 4x4 algorithm found

Someone found a new algorithm for 4x4 matrix multiplication. .

This will save the industry billions.

2 Likes

Probably an overstatement. See the discussion of the previous AlphaTensor results: Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes."

4 Likes

Also… are they numerically stable? You not only need to care about performance but also the error growth of the new algorithms, and for Strassen’s it requires modifications to make that work:

AlphaEvolve didn’t have any measure of numerical stability, and lots of these “faster matrix multiplication” algorithms have difficulty in this area. So I would presume AlphaEvolve will have poor error growth behavior without extra changes and corrections, and those of course slow it down.

11 Likes

Related:

4 Likes

Where can I find AlphaEvolve’s original paper on this new algorithm?

1 Like

But wouldn’t it at least work for something like large language models/etc? In large language models, the precision doesn’t really matter that much. People have even been able to quantize the model to 4 bits with minimal performance loss.

1 Like
1 Like

I’m very surprised by the use of a general-purpose LLM, Gemini, for dealing with a narrowly defined mathematical problem of matrix multiplication. Gemini’s training texts presumably include e.g. Shakespear’s works and New York Times editorials, both of which could have subtly affected the discovery of a matrix multiplication algorithm, as the transformer neural networks could potentially couple everything.

When all you have is a hammer that you spent hundreds of millions of dollars on, everything looks like a golden nail that might have some ROI.

17 Likes

If anybody is interested in trying to do “algorithm discovery” in Julia, here’s one way you can do it via evolution. No LLMs, just regular genetic algorithms.

using SymbolicRegression
using MLJBase

Let’s try to discover 2x2 matrix algorithms, for some constraints on the operations.

First, we make some random 2x2 matrices and compute their products. These are the labels we will try to regress

N = 1_000
X = randn(Float32, N, 2 * (2 * 2))

# Label = 4-tuple representing matrix product
y = [
    let
        A = reshape(x[1:4], 2, 2)
        B = reshape(x[5:8], 2, 2)
        C = A * B
        tuple(C[:]...)
    end
    for x in eachrow(X)
]

For simplicity, let’s say that we want to try to rediscover the Strassen formula for 2x2 multiplication (just scalars). We can use a “template expression” to try to do this, which lets us evolve multiple expressions simultaneously. Here, we will evolve 14 expressions: 10 intermediates (some redundancy for easier evolution), and 4 outputs, one per element.

expression_spec = @template_spec(expressions=(
    m1, m2, m3, m4, m5, m6, m7, m8, m9, m10,  # intermediate expressions
    c11, c12, c21, c22                        # output expressions
)) do a11, a12, a21, a22, b11, b12, b21, b22  # input features

    features = (a11, a12, a21, a22, b11, b12, b21, b22)
    # First, we compute the intermediates
    M1 = m1(features...)
    M2 = m2(features...)
    M3 = m3(features...)
    M4 = m4(features...)
    M5 = m5(features...)
    M6 = m6(features...)
    M7 = m7(features...)
    M8 = m8(features...)
    M9 = m9(features...)
    M10 = m10(features...)
    intermediates = (M1, M2, M3, M4, M5, M6, M7, M8, M9, M10)
    # Then, compute the outputs
    C11 = c11(intermediates...)
    C12 = c12(intermediates...)
    C21 = c21(intermediates...)
    C22 = c22(intermediates...)

    # Output needs to be a `ValidVector` type; these automate vectorization and propagate validity checks
    return ValidVector(
        map(i -> (C11.x[i], C12.x[i], C21.x[i], C22.x[i]), eachindex(C11.x)),
        C11.valid && C12.valid && C21.valid && C22.valid
    )
end

We also need to define a custom loss function to support our 4-tuple outputs:

tuple_loss(prediction, target) = sum(abs2, map(-, prediction, target))

Finally, we throw it into a regressor object. So that we can try to discover Strassen, we can make the * operation more “complex” than + and -:

model = SRRegressor(
    binary_operators=(+, -, *),
    unary_operators=(),
    complexity_of_operators=[(+) => 1, (-) => 1, (*) => 2], # Make * more expensive
    complexity_of_constants=10,       # Prevents constants from being used
    should_optimize_constants=false,  # Also, no need for BFGS constant tuning
    expression_spec=expression_spec,
    elementwise_loss=tuple_loss,
    loss_type=Float32,
    batching=true,
    batch_size=32,
    maxsize=200,  # Total complexity overall all expressions
    niterations=100000,  # How long the search lasts
)

mach = machine(model, X, y; scitype_check_level=0)
fit!(mach)

With this the search was able to ‘rediscover’ regular matrix multiplication in about 10minutes on my laptop. With enough compute to burn, this might get to Strassen. I think doing it will need a lot of compute though. Would help to also try tuning with larger values for population_size (and maybe populations and ncycles_per_iteration) to support the large search space needed.

47 Likes

How many “test programs” were generated and executed in the 10 minutes?

The search above looks to be about ~2e7 expressions. The vast majority of which are only evaluated on a 32-example mini-batch per the batch_size argument.

The equivalent brute force: the expression is, if I remember correctly, about 50 “complexity”. So, if you use Catalan numbers and make a bunch of simplifying assumptions, you should find something like 1e50 (give or take 10 orders of magnitude) expressions are possible when there are 8 leaf types (a11 through b22) and 3 node types (+, -, *). Traditional evolutionary algorithms can do a fairly good job in reducing the search space despite not having any “intelligence.”

8 Likes

Despite being general-purpose, AlphaEvolve goes beyond [25], improving the SOTA for 14 matrix multiplication algorithms; notably, for 4 × 4 matrices, AlphaEvolve improves Strassen (1969)’s algorithm by discovering an algorithm using 48 multiplications to multiply 4 × 4 complex-valued matrices.

I.e. improvement of 48/49 = 2.04% improvement/reduction (i.e. of one multiply).

But for the 3×4 times 4×6 they have a higher 4.5% speedup/improvement (three multiplies skipped). And for 4×5 times 5×6 there’s 3.2% improvement, again 3 fewer multiplies (all the other new algorithms have only 2 or 1 fewer), see table 2 for other improvements.

Table 2 … For ⟨3, 4, 7⟩, ⟨4, 4, 4⟩, and ⟨4, 4, 8⟩, the algorithms discovered by AlphaEvolve use complex-valued multiplications which can be used for exact multiplication of complex or real-valued matrices.

Should we change the title to e.g. New matrix algorithms found including a 4x4 (complex-valued) algorithm?

If you know your matrices are real-valued, do we already have better still, despite we can use the one with 48 or older 49? See their footnote 3

There exist algorithms using fewer than 49 multiplications, but they do not correspond to decompositions of the matrix multiplication tensor, and they cannot be applied recursively to multiplying larger matrices.

I.e. their new 48 isn’t always lowest for 4 × 4 matrix multiply, only when real- or complex valued. See their older work for Tensors from 2022 with AlphaTensor (fig 3, there in regular (open access) Nature):

https://www.nature.com/articles/s41586-022-05172-4

multiplying 4 × 4 matrices using 47 multiplications in Z2, thereby outperforming Strassen’s two-level algorithm2, which involves 72 = 49 multiplications.

They also have an update on AlphaTensor today (intriguingly in Nature Machine Intelligence volume 7, not sure why) for Quantum circuit optimization:
https://www.nature.com/articles/s42256-025-01001-1

If I remember it correctly, the 47 scalar multiplications are for binaries, not floating numbers, so it’s much less important than the one found this time.

1 Like

The LLM is used for code generation, and doesn’t need to know much about matrix multiplication. IMO the key is to couple the LLM to a feedback loop. They apply automated evaluation metrics (rewards), which are usually tricky to specify. This is applied to four problem domains suitable for automated iteration, one of which is matrix multiply. They also keep a database of candidate programs, and have a prompt sampling approach that can feed these in to the LLM. They are optimizing these prompts, so it’s an extra optimization loop on top of the program optimization.

As for the comment elsewhere above about numerical stability, my take is that it could be built into the evaluation, so you could optimize a weighted combo of speed and stability. It is also true that their particular application is AI, so they care more about speed than accuracy.

I wonder if it would make sense to optimize on energy usage. Perhaps there are common bit patterns in AI, where the matmul could reduce bit flipping, or favor pure adds over multiply-adds or such. I presume they want both speed and energy economy.

1 Like

The algorithm discovery papers typically cast the problem of optimizing matrix multiplication as that of finding the lowest-rank decomposition of a particular tensor. It works as follows: You can express matrix multiplication as contraction of the matrices with a tensor L:

C_{ij} = L_{ijklmn} A_{kl} B_{mn}

(Einstein summation implied), where L_{ijklmn} = \delta_{ik} \delta_{lm} \delta_{jn} . Now your challenge is to find a decomposition as follows:

L_{ijklmn} = \sum_{p = 1}^T \gamma_{ijp} \alpha_{klp} \beta_{mnp} \, ,

where T is called the rank of the decomposition. Given such a decomposition, you can reorganize the matrix product such that p is the last index you contract over:

C_{ij} = \sum_{i = 1}^T \gamma_{ijp} \left[\left(\sum_{kl} \alpha_{klp} A_{kl} \right)\left(\sum_{mn} \beta_{mnp} B_{mn}\right)\right] \, .

Clearly, the total number of multiplications is proportional to T. Further, if every element in the \gamma, \alpha, \beta tensors is \in \{-1, 0, +1\}, the tensor contractions simplify to just indexing, addition, and subtraction, so the only multiplication per p is the one between (\alpha_{klp} A_{kl}) and (\beta_{mnp} B_{mn}). Thus, the total number of multiplications is T.

Moreover, A and B are typically large block matrices or some other “heavy” type such that multiplying elements of A and B is much more expensive than any other operation. Under this assumption, only the multiplications between (\alpha_{klp} A_{kl}) and (\beta_{mnp} B_{mn}) count in terms of complexity. In the research literature these are called active multiplications, as opposed to all other operations, including any nontrivial scalar multiplications in the tensor contractions, which are inactive. Since the number of active multiplications is always equal to the rank T, this remains the minimization objective even when you allow for rational/real/complex tensor elements. For example, in the much-discussed T = 48-decomposition for 4x4 matrices, the tensor elements are \in \{0, \pm 1/2, \pm i/2, \pm (1 \pm i) / 2\}, so the algorithm contains lots of inactive scalar multiplications in addition to the 48 active multiplications.

(See @stevengj’s comment in the previous thread for more context around the application domain of these algorithms: https://discourse.julialang.org/t/matrix-multiply-breakthrough-alphatensor-could-also-do-for-other-algorithms-alphatensor-discovered-algorithms-that-are-more-efficient-than-the-state-of-the-art-for-many-matrix-sizes/88455/2.)

Finally, in this formulation, it’s straightforward to check the correctness of the algorithm: you just verify that

\sum_{p = 1}^T \gamma_{ijp} \alpha_{klp} \beta_{mnp} = \delta_{ik} \delta_{lm} \delta_{jn} \, .

for each of the 64 combinations of i,j,k,l,m,n \in \{1, 2\}. If you restrict your search to rational tensor elements and perform this verification in exact rational arithmetic, you can be certain that the algorithm you’ve found is exactly correct (in infinite-precision arithmetic, that is—whether it’s stable when taking into account floating-point roundoff is another question).

@MilesCranmer can SymbolicRegression.jl be applied to a problem like this? The tensor decomposition expression has a fixed form, so we’re not allowed to arbitrarily modify the expression tree; the only degree of freedom is to increase or decrease the number of terms T, and the objective is for it to be as small as possible. As for the coefficients, when rediscovering Strassen’s algorithm, we can restrict them to \{-1, 0, 1\}, in which case whether or not we’re on target is a binary property. However, if optimizing over continuous coefficients is preferred, we can relax the coefficients to, say [-1, 1], and try to minimize the distance from the target in a least squares sense. For 2x2 matrices, this problem has 64 equations in 12T unknowns.

10 Likes

Btw. this means that the following claim from the abstract, now repeated all over the internet, is at best misleading:

AlphaEvolve developed a search algorithm that found a procedure to multiply two 4 × 4 complex-valued matrices using 48 scalar multiplications

It seems like the AlphaEvolve paper uses “scalar multiplications” the same way other papers use “active multiplications”, making it sound like the algorithm can multiply two 4x4 matrices of plain complex numbers (i.e., not block matrices) using only 48 complex multiplications.

However, here’s what an implementation looks like: AlphaEvolve-MatrixMul-Verification/matrix_multiplication_algorithms.py at da09d7c476353fe888331922f16d3449abb2b21d · PhialsBasement/AlphaEvolve-MatrixMul-Verification · GitHub. You can see the 48 active multiplications starting from line 525 under the comment # Perform the 48 multiplications efficiently. But the tensor contractions, implemented in the 100 lines before and 140 lines after, contain countless multiplications, which are just as expensive as the active multiplications if your input matrix has scalar elements.

That said, the implementation could be better optimized than this. You could double each tensor component, eliminating inactive multiplications from the tensor contractions and obtaining 8C. Finally, multiply each element by 1/8 to get C; that’s 16 inactive multiplications. In total that’s 64 complex multiplications—exactly the same as in the naive algorithm (but with a lot more additions and subtractions!).

I should add that the claims remain true and valid for block matrices where the distinction between active and inactive multiplications is meaningful. My gripe is only with the choice of words, which makes it sound like the algorithm can do things it cannot.

6 Likes

Potentially yes; check out Custom Types · SymbolicRegression.jl. This demonstrates how to evolve expressions on ::String.

So you could create a new type that implements the same interface for tensors. Then pass operators which can handle that type.


Although, for searching on non-scalar but numeric types, where you want to still use BFGS for constant opimtization during the evolution, you likely want to set:

can_optimize(::Type{<:MyType}, _) = true

You also need to declare things like count_scalar_constants, pack_scalar_constants, and unpack_scalar_constants, so that the library knows how to get a single vector of all your constants.

To implement this, you have to conform to DynamicExpressions.ValueInterface (which is declared and tested using Interfaces.jl). You can see this file for an example: DynamicExpressions.jl/test/test_non_number_eval_tree_array.jl at 605e1116041a0d36a51b0ee90617a4ca9bea39fd · SymbolicML/DynamicExpressions.jl · GitHub.

3 Likes

Maybe we could start a new thread if we’re interested in trying to discover a new matrix multiplication algorithm with Julia, though I doubt it would go anywhere beyond the current SOTA, but who knows?

1 Like

Maybe more appropriate names are “cheap multiplications” (N^2 operations each) and “expensive multiplications” (N^3 operations each with the naive method).