Does there even exist an actual implementation?

Does there even exist an actual implementation?

I wouldnâ€™t have known, but google did turn up Matrix-Multiplication. From the descriptions Iâ€™ve heard, I always assumed that the break-even point would be for some matrix that is so large that I wouldnâ€™t have a computer with that much RAM.

Your search karma seems better than mine. I gave up after finding a reference to the original paper where it was noted that the authors state (start of section 5):

Previous authors in this field have exhibited their algorithms directly, but we will have to rely on hashing and counting arguments to show the existence of a suitable algorithm.

Coppersmith-Winograd is certainly a bad idea to implement, but what about Strassen? Would it beat

the plain method for sizes a few hundred?

the strassen reloaded papers put the cutoff closer to 1000 and they were benchmarking vs worse regular implementions. I would be interested in someone trying strassen on Octavian

Yes. Itâ€™s dependent on the hardware and the details of the implementation, but Strassenâ€™s algorithm (and the Winograd variant of Strassenâ€™s algorithm that trims away some additions) can have break-even points in that range. Doing a step or two of Strassenâ€™s algorithm before using a conventional BLAS call can speed things up on large matrices. People who really want to multiply large dense matrices do that sometimes.

But I donâ€™t think it has ever made it into a commonly used BLAS library, mostly for practical reasons: You need extra storage. It is less effective for products involving tall and wide matrices. It has weaker stability guarantees.

Speculating a little bit, it also doesnâ€™t seem entirely clear to me that it would be easy to get big benefits in other dense matrix computations by using a Strassen BLAS. LAPACK uses its own block sizes that are typically less than the sizes that would be strong candidates for Strassen. (On my machine, unless I messed something up, an attempted `ccall`

of LAPACKâ€™s `ilaenv`

returns 64 for the block size LU factorization with OpenBLAS). The calculation of optimal block sizes would presumably change if Strassenâ€™s algorithm were available. But there could be some places where that increases overall work in some secondary computation that was previously negligible for modest block sizes. (computation of triangular factors associated with block Householder transformations using `xLARTF`

seems like a candidate for something that becomes more costly with increasing block size.) That could make it harder to break even or do better than standard BLAS. Iâ€™m speculating and could be wrong, but using Strassen for such things doesnâ€™t look trivial to me.

The report Implementing Strassenâ€™s Algorithm with BLIS makes the case that some of these concerns can be handled. They are getting good results for matrix multiplication of large matrices. Iâ€™d still be skeptical about broader use in dense matrix computations, however.

To have more than marginal (> 10%) speedup they need a few thousand, and to have > 20% speedup they need sizes \gtrsim 10^4. This seems pretty consistent with other experiments Iâ€™ve seen over the years, e.g. with ATLAS BLAS in 2006.

See also this discussion: Matrix multiply breakthrough, AlphaTensor (could also do for other algorithms): "AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes." - #7 by stevengj

IIRC, some of the issues with the algorithms are decreased numerical stability though, and thatâ€™s something that also needs to be factored into the discussion. Faster without accuracy isnâ€™t necessarily faster. But it would be interesting to have a good Julia implementation to play around with and benchmark in real applications.

Specifically, for Strassen you lose componentwise bounds. For C=AB, ordinary matrix multiplication gives computed \hat{C} with

For the original Strassen algorithm, the bounds are normwise and of the form

for some f_n that has a larger constant factor but is not really dramatically more quickly growing than a normwise bound would be for conventional multiplication, especially if recursion isnâ€™t going too deep before you apply conventional multiplication. The Winograd variant with fewer additions doesnâ€™t necessarily satisfy the normwise bound unless you introduce some scaling. Unless you are specifically worried about componentwise errors, this is probably still fine for many uses. There is a very nice chapter in N. Highamâ€™s â€śAccuracy and Stability of Numerical Algorithmsâ€ť that covers the stability of these algorithms, which is where I pulled this from.

Despite the name, at a quick look that implementation seems to be the Winograd form of Strassenâ€™s method, not the Coppersmith-Winograd O(n^{2.3755}) algorithm from 1990, which is significantly more complex.

Good point. So I guess it was mislabeled and I fell for it. I did look briefly at the implementation and was surprised it didnâ€™t look more complicated. I was also pondering the point made by @GunnarFarneback that the original paper looks more like an existence proof than a full specification. So now Iâ€™m circling back to wondering if the algorithm has even been fully described anywhere. I suppose that if you go much past Strassenâ€™s algorithm an existence proof is as useful as a fully optimized implementation.

I also do not know all the details (and those I know are very rusty), but from what I understand it is very hard to turn the mathematics into an actual implementation of CW or any later algorithm. The proof is indeed an existence proof, proving with probabilistic methods the existence of a sequence of algorithms (for increasing sizes n) that attain the desired complexity.

I wanted to add it here for completeness.

```
â€śGalacticâ€ť algorithms, a term coined by Regan, are algorithms
with good-looking asymptotic running times but concrete costs
so high as to prevent their practical use on scales smaller than
the universe.
```