Actually, I don’t think the optimization of 1/8 multiplications can be done in Julia. It is likely that this kind of optimization should be done at some lower levels, like llvm or native, or even hardware.
For the usefulness of such kind of algorithms, the key point is that it is not used to simply multiply two 4x4 matrices. Instead, its use cases are when the elements of the 4x4 matrices are also matrices, i.e. block matrices, just like how you use the Strassen algo: multiplying two very large matrices by recursively partitioning the matrices as 2x2 block matrices and using Strassen. In this case, as I stated in a previous post, the “active multiplications” are expensive as their complexity is N^3 (with the naive method), while the “inactive multiplications” are cheap as their complexity is N^2. That is the complexity hierarchy, which with recursion makes the asymptotic complexity of Strassen reduced to N^2.80735, regardless of how you optimized your “inactive multiplications”. The optimization of inactive multiplications only affects the matrix sizes down to which the Strassen algo is still faster than the naive method, i.e. the proper endpoint of the recursion. The new 4x4 algo is similar: due to the complexity hierarchy, for large enough matrices, this new algo is always faster than the naive method (and also Strassen) by partitioning. But if you do not optimized the inactive multiplications sufficiently, the threshold of matrix sizes, below which applying the new algo is no longer beneficial, is so large (like 10000x10000) that the new algo can hardly be used in real life.
If you want to see a working code for (recursive) Strassen in Julia, you can refer to this post: