# Row and column major order for arrays of different shape

In the following code the functions r_major and c_major iterate over array a in the row and column major order:

using BenchmarkTools

function r_major!(a)
N1, N2 = size(a)
for i=1:N1
for j=1:N2
a[i,j] = 2
end
end
return nothing
end

function c_major!(a)
N1, N2 = size(a)
for j=1:N2
for i=1:N1
a[i,j] = 2
end
end
return nothing
end

N1, N2 = 400, 400
# N1, N2 = 80000, 2
# N1, N2 = 2, 80000
a = zeros((N1, N2))

@show N1, N2, N1 * N2
@btime r_major!(\$a)
@btime c_major!(\$a)

Depending on the shape of a , I obtain the following results:

(N1, N2, N1 * N2) = (400, 400, 160000)
238.082 Ī¼s (0 allocations: 0 bytes)
34.344 Ī¼s (0 allocations: 0 bytes)

(N1, N2, N1 * N2) = (80000, 2, 160000)
128.841 Ī¼s (0 allocations: 0 bytes)
34.441 Ī¼s (0 allocations: 0 bytes)

(N1, N2, N1 * N2) = (2, 80000, 160000)
71.994 Ī¼s (0 allocations: 0 bytes)
100.357 Ī¼s (0 allocations: 0 bytes)

As expected, with array a of size 400x400 the loop with column major order is much faster. With size 8000x2 (the total number of elements is the same), the column major order is still faster, but the time of the row major loop decreases twice, which is weird. However, when the size is 2x8000, the column major loop becomes slower than the row major loop. How is this possible if the number of elements does not change and only the shape of the array is changed?

Your inner loops are iterating over rows first in c_major! and columns in r_major!

So c_major! is cache missing

Are there any strategies to avoid cache missing?
I have a multidimensional array with runtime defined dimensions. I know that the first dimensions are much smaller than the last one. So far I follow the āperformance tipsā and organize loops in a column-major order, which is not always optimal due to ācache missingā. Will it be always better to move the longest dimension to be the first one? Would it help to create a long array of small StaticArrays?

Accessing memory using contiguous blocks is the way to avoid cache misses.

Youāve seen with your code what the effect of flushing the cache is on performance.

Hmm, reading this my intuition tingles that I have it the wrong way around

https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-column-major

Remembering the ordering of arrays can have significant performance effects when looping over arrays. A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly. Essentially this means that looping will be faster if the inner-most loop index is the first to appear in a slice expression. Keep in mind that indexing an array with : is an implicit loop that iteratively accesses all elements within a particular dimension; it can be faster to extract columns than rows, for example.

Yeah, this is exactly the part from the āperformance tipsā which I always had in mind when writing loops. However, it seems that there are some corner cases.

Note that, by far the fastest way is to use eachindex:

function rowmajor!(A)
for i in axes(A, 1), j in axes(A, 2)
@inbounds A[i, j] = 2.0
end
return nothing
end

function colmajor!(A)
for j in axes(A, 2), i in axes(A, 1)
@inbounds A[i, j] = 2.0
end
return nothing
end

function linear!(A)
for i in eachindex(A)
@inbounds A[i] = 2.0
end
return nothing
end

Benchmark:

1.7.2> @btime rowmajor!(A) setup=(A=rand(2, 8000));
4.950 Ī¼s (0 allocations: 0 bytes)

1.7.2> @btime colmajor!(A) setup=(A=rand(2, 8000));
8.050 Ī¼s (0 allocations: 0 bytes)

1.7.2> @btime linear!(A) setup=(A=rand(2, 8000));
1.650 Ī¼s (0 allocations: 0 bytes)

My (somewhat uninformed) guess is that for wide and low matrices, you need to do a lot of divrem calculations to find the correct memory locations for colmajor!, while rowmajor! can just use a fixed stride, and with just 2 rows you donāt waste that much cache.

eachindex just uses linear indices and optimal cache use.

2 Likes

I donāt really know this very well, but had a look at the native code which contained a vbroadcastesd command, and looking this up it seems to

Broadcast double-precision floating-point element in mem to four locations in ymm1.

And testing with N1=4,N2=40000 we still have column major faster, so maybe this effect is from this specific broadcast command which I assume happens in the inner loop, canāt be fully utilized when the loop is less than 4 iterations.

For the other cases my guess is
Case 1
Row major will iterate over non-contiguous memory, 400 different instances of it, and thus not be able to keep all sub blocks in cache and miss a lot.
Column major is all fine here, iterate over contiguous memory and inner loop is long enough.
Case 2
Row major will iterate over only two different columns, and if two block are loaded, one for a part of each column, there wont be any cache misses until we reach the end of both blocks, when there will be two misses, and so on. So this might have far less cache misses than the first case (if my assumptions here are correct).
Column major is still all fine, iterate contiguous memory and inner loop is long enough.
Case 3
Row major now will load single contiguous block, containing multiple consecutive columns, depending on block size. Will also not have as many cache misses since columns are accessed in sequence. Now we also have a better inner loop than case 2 for row major, so this might be the case why it is faster.
Column major, still good layout but now we have worse inner loop since it only iterates over 2 elements and cant utilize the 4 float write fully.

1 Like

Thank you all for the answers. Iāll dig around a bit with different scenarios.
Just out of curiosity, the figure below shows the comparison of runtimes for different N1 values (N1xN2 is kept constant).

2 Likes

Oh, you had the same speed for N1 as 2 and 4? For me there was a factor of around 2 between them. Maybe we have different architecture and get different native code, or maybe i was wrong, as i said i dont really know much about this

1 Like

But your conclusion is still correct - the column major becomes faster (does not matter that it is due slower row major)

No divrem operations are required AFAIK. But if your inner loop has length 2 then you may pay a price for re-starting the inner loop so frequently (e.g. branch-prediction failures).

(Whereas eachindex uses linear indexing, so it is in memory-contiguous order and has minimal loop-indexing overhead.)

(The most common case for having lots of 2xN arrays seems to be people who really want an array of 2-component vectors, e.g. to represent points in 2d. In which case they should consider using StaticArrays, i.e. a Vector{SVector{2,T}} of length N. In this case the loops of length-2 will be statically unrolled.)

7 Likes

In my actual code I have large amount (of length N2) of rather small square matrices (of size N1 x N1). With each matrix, I perform the same set of operations. As a result, this code is a perfect candidate for parallel processing. In the corresponding parallel code, the outer loop must always run along N2. Therefore, I can not use eachindex approach. My task is to find the best structure which will result in the fastest loop. The possible candidates are:

a = zeros(N2, N1, N1)
b = zeros(N1, N2, N1)
c = zeros(N1, N1, N2)
d = [zeros(N1,N1) for k=1:N2]
e = [MMatrix{N1,N1}(zeros(N1,N1)) for k=1:N2]

that is a three-dimensional array with a different position of the longest dimension N2, array of matrices, and array of matrices from StaticArrays package.
The two possible loops are

# row-major
for k=1:N2
for n=1:N1
for m=1:N1
a[k,n,m] = 2
end
end
end

# column-major
for k=1:N2
for m=1:N1
for n=1:N1
a[k,n,m] = 2
end
end
end

(the similar loops are for arrays b, c, d, and e).
The figures below show the timings for the corresponding loops in case of N2 equal to either 500 or 1000 and N1 ranging from 2 to 10.

As you can see, up to N1=8 the optimal data structure is e (array of static matrices) which is iterated by the column-major order. Starting from N1=8 there is a huge speed degradation for e. Moreover for N2=1000 (second figure) and N1=10 the row-major order is faster for e case.

Surprisingly, at N1>7 the optimal structure is b (3d array with the longest dimension in the middle) for N2=500. For big matrices in case of N2=1000 the best structure is c.

To sum up, it seems that there is no unique rule of thumb which will give the fastest results independently of the array size. Sad.

I think you should consider using LoopVectorization. The MWE

using StaticArrays
using BenchmarkTools
using LoopVectorization

function rowmajorarray(x)
# row-major
@turbo for k=1:size(x,1)
for n=1:size(x,2)
for m=1:size(x,3)
x[k,n,m] = 2
end
end
end
end
function rowmajorvector(x,s1,s2)
# row-major
for k=1:size(x,1)
xk = x[k]
@turbo for n=1:s1
for m=1:s2
xk[n,m] = 2
end
end
end
end

function columnmajorarray(x)
# column-major
@turbo for k=1:size(x,1)
for m=1:size(x,3)
for n=1:size(x,2)
x[k,n,m] = 2
end
end
end
end
function columnmajorvector(x,s1,s2)
# column-major
for k=1:size(x,1)
xk = x[k]
@turbo for m=1:s1
for n=1:s2
xk[n,m] = 2
end
end
end
end

N1 = 10
N2 = 1000
a = zeros(N2, N1, N1)
b = zeros(N1, N2, N1)
c = zeros(N1, N1, N2)
d = [zeros(N1,N1) for k=1:N2]
e = [MMatrix{N1,N1}(zeros(N1,N1)) for k=1:N2]
@btime rowmajorarray(\$a)
@btime rowmajorarray(\$b)
@btime rowmajorarray(\$c)
@btime rowmajorvector(\$d,N1,N1)
@btime rowmajorvector(\$e,N1,N1)
@btime columnmajorarray(\$a)
@btime columnmajorarray(\$b)
@btime columnmajorarray(\$c)
@btime columnmajorvector(\$d,N1,N1)
@btime columnmajorvector(\$e,N1,N1)

yields

22.700 Ī¼s (0 allocations: 0 bytes)
22.600 Ī¼s (0 allocations: 0 bytes)
24.700 Ī¼s (0 allocations: 0 bytes)
25.600 Ī¼s (0 allocations: 0 bytes)
24.800 Ī¼s (0 allocations: 0 bytes)
22.400 Ī¼s (0 allocations: 0 bytes)
22.800 Ī¼s (0 allocations: 0 bytes)
24.900 Ī¼s (0 allocations: 0 bytes)
26.300 Ī¼s (0 allocations: 0 bytes)
24.500 Ī¼s (0 allocations: 0 bytes)

for your different data structures and access methods, which is anywhere from 2x to 8x faster on my system.

1 Like

Thank you for the example.
Do you know if LoopVectorization will work inside CUDA kernels?

Maybe you have simplified the example so much that it is possible to misunderstand. I mean, why not

a = fill(2.0, N2, N1, N1)

?

Can you give a somewhat more realistic example for the values in your arrays? Where do you get the values from, and why must you fill the arrays in such an awkward fashion?

No, AFAIK. Tullio is an option that seems to support CPU and GPU but has a rather specific API.

I suspect that LoopVectorization is not the right end to start at here, and that there is some wasted performance you can fix by improving the basics. But itās still a bit unclear what you want inside your arrays.

Just as an example, you can make an array of SMatrixes like this:

x = [@SMatrix fill(2.0, N1, N1) for _ in 1:N2]

And then you can update it like this:

x .= Ref(@SMatrix fill(3.0, N1, N1))

For small dimensions, N1, this could be really fast, but for larger values, it probably disappears. Using SMatrix could also be very beneficial for the rest of your code.

Just FYI, this:

creates a regular matrix and then converts it to an MMatrix. That wastes performance. Better to directly create an MMatrix (or SMatrix), using one of the macros:

e = [@MMatrix zeros(N1, N1) for k=1:N2]

Furthermore, to get the wanted performance, N1 must be a compile time constant. If itās a runtime value, it will not work well.

1 Like

I have a big system of differential equations and I need to fill in the left-hand-side of it (the derivatives). The right-hand-side involves other arrays of matrices of the same shape. Here Iām just trying to figure out if there is a universal solution that doesnāt depend on the size of the array.

The problem is that your example just fills the arrays with the same value. So then you can definitely use eachindex. It would be easier to help if we could see why eachindex isnāt possible, and maybe find some way around it.

I suspect that you have oversimplified your MWE.

You have a point:

@btime fill!(\$a, 2)
@btime fill!(\$b, 2)
@btime fill!(\$c, 2)
@btime foreach(dk -> fill!(dk, 2), \$d)
@btime foreach(ek -> fill!(ek, 2), \$e)

yields

17.300 Ī¼s (0 allocations: 0 bytes)
17.300 Ī¼s (0 allocations: 0 bytes)
16.800 Ī¼s (0 allocations: 0 bytes)
18.500 Ī¼s (0 allocations: 0 bytes)
21.600 Ī¼s (0 allocations: 0 bytes)
1 Like