What are the pros and cons of row/column major ordering?

I understand that because matrices in Julia are stored in column major order, it is best to iterate over them in column major order.

What is less clear to me is why one would want to store matrices in column major order in the first place.

Consider a matrix-vector multiplication

Ax = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \begin{pmatrix}x_{11} \\ x_{21} \\ x_{21} \end{pmatrix} = \begin{pmatrix} b_{11} \\ b_{21} \\ b_{21}\end{pmatrix}.

To compute the first entry of b, we take the dot product of the first row of A with x. When A is stored in row-major order, the first row of A will be contiguous in memory. And the column vector x will still be contiguous in memory because there is only one entry per row. But if we stored things in column-major order, then only x would be contiguous memory.

Doesn’t this mean that column-major ordering makes matrix multiplication less cache-friendly? And I would think matrix multiplication is the most common task done with matrices, so that we would want to choose the storage method which is best for matrix multiplication.

1 Like

This has been discussed extensively, as you can imagine, and someone might be able to link to older discussions. But, the bottom line is, one or the other has to be chosen.
Physicists might call it symmetry breaking, and once you choose one option, efficiency demands propagating this choice appropriately to many other functions.

5 Likes

If you use the dot product formulation, that is true. But matrix vector multiplication can also be viewed as forming a linear combination of the columns of A using the elements of x. In fact, I think most texts on introductory linear algebra start with the linear combination definition. In code the two variants might look like:

function mul_dot(A,x)
  m, n = size(A)
  result = zeros(eltype(x), m)
  for j in 1:m
    for k in 1:n
      result[j] += A[j,k] * x[k]
    end
  end
  return result
end

function mul_comb(A,x)
  m, n = size(A)
  result = zeros(eltype(x), m)
  for k in 1:n
    for j in 1:m
      result[j] += A[j,k] * x[k]
    end
  end
  return result
end

The first one is cache-friendly for row-major order and the second is cache-friendly for column-major order. And if you need matrix-matrix multiplication you need to do blocking to be cache friendly, which is what is done in BLAS.

7 Likes

That’s true for matrix–vector multiplication, but for matrix–matrix multiplication, both are cache-unfriendly.

Above, you are only talking about spatial locality (consecutive access), but to optimize matrix–matrix multiplication for cache you mainly need to think about temporal locality (re-using a number multiple times once it is in cache). The problem with either rows-times-columns or columns-times-rows is that it does O(1) work per cache miss, regardless of the storage format.

To optimize temporal locality, you need to multiply the matrices by submatrix blocks, roughly \sim \sqrt{Z} \times \sqrt{Z} blocks for a cache that holds Z elements. That way, you load a block into cache (Z cache misses) and then do O(Z^{3/2}) work for the block multiplication. This does O(\sqrt{Z}) work per cache miss, which allows you to completely mask the memory latency and (roughly) hit the peak flop rate of the CPU.

See also the analysis in this paper, or this Julia notebook from my course notes with some experiments. The MIT course 6.172 has two free lecture videos (first and second) on cache-efficient algorithms, including a discussion of matrix multiplication.

The upshot is that, for highly optimized matrix-multiplication algorithms, there probably isn’t much difference between multiplying two row-major vs. two column-major matrices, or either one by a vector.

See the discussion: Why column major?

21 Likes

Are we discussing matrix-vector or matrix-matrix multiplication? I tried to cover matrix-matrix with a mention of blocking, and didn’t intend to give an impression that what I wrote could be scaled up to matrix-matrix multiplication in a cache-friendly way. But perhaps I took too narrow a view of the question.

3 Likes

Sorry, I realized I had accidentally broadened the discussion. I edited my post to clarify.

You’re right that for matrix–vector multiplication, all that matters (for the cache) is spatial locality; temporal locality is only possible (and hence important) for the matrix–matrix case.

In both cases, however, there doesn’t seem to be any major advantage to either row- or column-major for optimized algorithms.

With dense matrices, the time is usually dominated by \Theta(m^3) operations like matrix–matrix multiplication, LU factorization, etcetera. (These are the algorithms where temporal locality is important.) That’s why I jumped to those algorithms.

In contrast, \Theta(m^2) operations like matrix–vector operations are much less important (but still have no inherent disadvantage for column-major AFAIK).

5 Likes

In fact, we can directly benchmark matrix–vector operations for both row-major and column-major in Julia. The reason is that transpose operations in Julia are “lazy” — they just re-interpret the array indices, and so effectively change the storage format from column-major to row-major.

Under the hood, multiplying transpose(A) * x calls an optimized BLAS routine that works with this data format. If we benchmark, we can see that there is negligible advantage in speed even for the highly optimized implementation:

julia> using LinearAlgebra, BenchmarkTools

julia> LinearAlgebra.BLAS.set_num_threads(1) # single-threaded benchmarks are easier to understand

julia> A = rand(1000,1000); x = rand(1000); y = similar(x);

julia> @btime mul!($y, $A, $x); # column-major matrix*vector
  156.481 ÎĽs (0 allocations: 0 bytes)

julia> @btime mul!($y, $(transpose(A)), $x); # row-major matrix*vector
  151.427 ÎĽs (0 allocations: 0 bytes)

(The difference is about 3%.)

7 Likes

Things are a bit more complicated.

julia> using LoopVectorization, BenchmarkTools

julia> A = rand(32,32); x = rand(32); y = similar(x);

julia> function mul_dot!(result, A,x)
         m, n = size(A)
         @turbo for j in 1:m
           for k in 1:n
             result[j] += A[j,k] * x[k]
           end
         end
         return result
       end
mul_dot! (generic function with 1 method)

julia> @btime mul_dot!($y,$A,$x); # column major
  52.329 ns (0 allocations: 0 bytes)

julia> @btime mul_dot!($y,$A',$x); # row major
  78.010 ns (0 allocations: 0 bytes)

LV can reorder loops, but it chooses the “dot” order regardless of what was initially written.

EDIT:
Note that matrix-vector multiply is dominated by the memory-traversal cost of the matrix. mul!(y, A, x) is about as fast as sum(A).
Single threaded mkl is faster here than @turbo for the 1000x1000 example above on my computer:

julia> using LinearAlgebra, BenchmarkTools
WARNING: using BenchmarkTools.params in module Main conflicts with an existing identifier.

julia> LinearAlgebra.BLAS.set_num_threads(1) # single-threaded benchmarks are easier to understand

julia> A = rand(1000,1000); x = rand(1000); y = similar(x);

julia> @btime mul!($y, $A, $x); # column-major matrix*vector
  239.944 ÎĽs (0 allocations: 0 bytes)

julia> @btime mul!($y, $(transpose(A)), $x); # row-major matrix*vector
  238.171 ÎĽs (0 allocations: 0 bytes)

julia> @btime mul_dot!($y,$A,$x);
  268.371 ÎĽs (0 allocations: 0 bytes)

julia> @btime mul_dot!($y,$(transpose(A)),$x);
  272.558 ÎĽs (0 allocations: 0 bytes)

julia> @btime sum($A)
  273.718 ÎĽs (0 allocations: 0 bytes)
500124.58527710865

@turbo prefers column major, because you don’t need to do reductions.

For matrix-matrix multiply, there is no difference between column and row major. C = A * B is the same as transpose(C) = transpose(B) * transpose(A). You call the same function/low level implementation, and just swap the order of arguments depending on whether your representation is row vs column major.
A caveat here is that tall and skinny vs short and wide may have different performance, so highly non-square matrices may perform differently. In particular, short and wide C is easier to parallelize with column major than row major (and conversely, tall and narrow C is easier for row major).

8 Likes

That makes a lot of sense. Thanks!

1 Like

From the paper:

These results require the tall-cache assumption (1) for matrices stored in row-major layout format, but the assumption can be relaxed for certain other layouts.

I’ve been thinking, some alternative layout, neither row- nor column-major could be better, so interesting to see it confirmed. [I suppose the analysis there in the paper works for column-major too, no better or worse.]

We know sparse is better when it applies, and what I had in mind was a hybrid space-dense matrix. A sparse matrix of blocks, say 4x4 or 16x16.

But putting that aside, what other layouts (just dense) might they have in mind? I’m not sure a Z-order curve might be it, just showing it as an interesting alternative:

Calculating the Z-order curve (x , y )-coordinates from the interleaved bits of the z -value 2479.

It always looked like too complex to me to map to (and from) those coordinates, but I see in the figure there it’s actually rather neat. And it’s O(1) to do… still seems to work against how Julia works. If this (or some other comparable) arrangement might work, then it could also be a hybrid solution. I.e. Z-order only to get to the blocks, and those could have typical layout (sized for L1 cache I guess), thus good for SIMD work.

The max array size for the example code is 65536*65536. Just use a power of 2 for ease, in that case the maximum wasted space is approx. 3/4

That’s rather small, but if 65536*65536 block could be ok, and Z-ordered only done for the higher bits.

GitHub - giaf/blasfeo: Basic linear algebra subroutines for embedded optimization defines a “panel major” format.

3 Likes