Specialized matrix-matrix multiplication algorithm

See the following discussion on what’s involved in implementing a fast matrix–matrix multiplication routine in pure Julia code (or any compiled language, for that matter), along with some example code: Julia matrix-multiplication performance

Also these course notes: 18335/notes/Memory-and-Matrices-latest.ipynb at spring21 · mitmath/18335 · GitHub

(A fundamental problem with implementing matrix–matrix multiplication as a sequence of matrix–vector products is that it has poor temporal memory locality, so it won’t take full advantage of the caches and you will end up being memory-bound. To do better, you have to divide the matrices into submatrix blocks rather than simply into columns; there are a variety of strategies for this, as discussed in the link above. But to get the last factor of 2–3 in performance, not including multi-threading, is pretty hard; you have to heavily optimize the low-level kernels.)

5 Likes