I haven’t yet analyzed it carefully, but played with it a little bit. Now my impressions are:
This newly found algo is for complex matrices, because the linear combinations of the factors are complex. It is not yet clear whether there exists specific optimization for real matrices (probably no).
If using this algo to deal with real matrices, the asymptotic complexity is still N^2.79248 (log_4 48\approx 2.79248), a slight but very important improvement over the Strassen algo’s N^2.80735, despite the prefactor 4 introduced due to the complex nature of the new algo.
All the “cheap multiplications”, in particular multiplying the complex numbers {±i/2,±(1±i)/2}, can actually be replaced by additions and subtractions, so this new algo has quite promising applications for matrix multiplications in our real life.
Can you explain? How do you get around the multiplication by \frac{1}{2}? (Either way, it’s unclear to me how this makes any difference for how useful the algorithm is. With or without these inactive multiplications, the algorithm is only useful in the regime where their cost is negligible compared to the active multiplications.)
I can’t get around the 1/2 totally, but: 1) for each matrix element there are three 1/2 factors in total, which can be combined into one 1/8; 2) multiplying 1/8 can be done as just shifting the exponent part, without doing floating point multiplications.
Multiplying complex numbers by (1±i) can be done by adding or subtracting their real and imaginary parts. But anyway, you are right, the usefulness of the new algo depends on to what extent you can optimize the inactive multiplications eventually.
Is there a way to actually make this faster than the multiplication? Single-instruction operations are hard to beat, and ldexp is certainly not a winner for performance, but perhaps there are better implementations?
julia> @btime b .= ldexp.(a, -3) setup=(a = rand(256); b = zero(a));
127.737 ns (0 allocations: 0 bytes)
julia> @btime b .= a ./ 8 setup=(a = rand(256); b = zero(a));
17.368 ns (0 allocations: 0 bytes)
No, my point was the opposite: the inactive multiplications are irrelevant, because an algorithm like this can only hope to be useful in situations where the active multiplications are orders of magnitude more expensive than everything else. Note that on modern CPUs, there’s no meaningful performance difference between adding floats and multiplying floats, so inactive multiplications are just as problematic, or just as irrelevant, as inactive additions and subtractions, which for this algorithm number in the hundreds regardless of how you optimize it.
(Sorry if this was confusing—when I was talking about minimizing the number of inactive multiplications, that was my attempt at steelmanning their “4x4 matrix multiplication in only 48 scalar multiplications” quote, and my conclusion was that it’s a misleading claim. When talking about the algorithms’ actual usefulness, however, we shouldn’t worry about inactive multiplications any more than addition and subtraction.)
Actually, I don’t think the optimization of 1/8 multiplications can be done in Julia. It is likely that this kind of optimization should be done at some lower levels, like llvm or native, or even hardware.
For the usefulness of such kind of algorithms, the key point is that it is not used to simply multiply two 4x4 matrices. Instead, its use cases are when the elements of the 4x4 matrices are also matrices, i.e. block matrices, just like how you use the Strassen algo: multiplying two very large matrices by recursively partitioning the matrices as 2x2 block matrices and using Strassen. In this case, as I stated in a previous post, the “active multiplications” are expensive as their complexity is N^3 (with the naive method), while the “inactive multiplications” are cheap as their complexity is N^2. That is the complexity hierarchy, which with recursion makes the asymptotic complexity of Strassen reduced to N^2.80735, regardless of how you optimized your “inactive multiplications”. The optimization of inactive multiplications only affects the matrix sizes down to which the Strassen algo is still faster than the naive method, i.e. the proper endpoint of the recursion. The new 4x4 algo is similar: due to the complexity hierarchy, for large enough matrices, this new algo is always faster than the naive method (and also Strassen) by partitioning. But if you do not optimized the inactive multiplications sufficiently, the threshold of matrix sizes, below which applying the new algo is no longer beneficial, is so large (like 10000x10000) that the new algo can hardly be used in real life.
If you want to see a working code for (recursive) Strassen in Julia, you can refer to this post:
(I think it can be done with reinterpret, bit mask + addition/substraction (guess -3 ?), reinterpret, but expect compiler / llvm to already exploit the underlying trick if it’s worth.)
What you’re describing is the ldexp function. But ldexp has really weak performance because it has to make sure it isn’t incorrectly applied where any inf, nan, subnormal, or zero is the input or output (where the described exponent-mangling is incorrect). And even if you skip those performance-killing checks (you can’t in general), it still is unlikely to be faster than multiplying by a float.
If anything, I’ve experimented with trying to make ldexp faster by going the other way – switching the exponent mangling to multiplication. But that takes some effort too (I never really achieved a meaningful difference) because ldexp supports larger exponent changes than multiplication (you need to do it in multiple steps for large exponent changes like ldexp(1e308, -2080)).
The only time I’ve personally managed to outperform native float arithmetic with bit-mangling is to invert a power of two that I already know to be in the range of exponent-mangleable inverses. I.e., for inverting a Float64 from the set 2^N for N \in \{-1022, ... +1022\}.
Sure, but this applies equally to the inactive additions and subtractions. We’ve already established that it’s trivial to get down to just 16 inactive multiplications (one scaling by \frac{1}{8} for each block element of the output), and there are hundreds of additions and subtractions, so I don’t think the inactive multiplications are the limiting factor.