What are the pros and cons of row/column major ordering?

That’s true for matrix–vector multiplication, but for matrix–matrix multiplication, both are cache-unfriendly.

Above, you are only talking about spatial locality (consecutive access), but to optimize matrix–matrix multiplication for cache you mainly need to think about temporal locality (re-using a number multiple times once it is in cache). The problem with either rows-times-columns or columns-times-rows is that it does O(1) work per cache miss, regardless of the storage format.

To optimize temporal locality, you need to multiply the matrices by submatrix blocks, roughly \sim \sqrt{Z} \times \sqrt{Z} blocks for a cache that holds Z elements. That way, you load a block into cache (Z cache misses) and then do O(Z^{3/2}) work for the block multiplication. This does O(\sqrt{Z}) work per cache miss, which allows you to completely mask the memory latency and (roughly) hit the peak flop rate of the CPU.

See also the analysis in this paper, or this Julia notebook from my course notes with some experiments. The MIT course 6.172 has two free lecture videos (first and second) on cache-efficient algorithms, including a discussion of matrix multiplication.

The upshot is that, for highly optimized matrix-multiplication algorithms, there probably isn’t much difference between multiplying two row-major vs. two column-major matrices, or either one by a vector.

See the discussion: Why column major?

21 Likes