Why column major?

Thanks @lmiq. I’ll use your framework to attempt to answer my own question, though I still don’t understand some of the weirdness illustrated by OP.

using BenchmarkTools

"""
Innermost loop is performed first.
Remember "Run first over first index."
Higher dimensions should appear last in indexing and as the topmost `for` loop.
"""
function update_fast1!(A)
    for k in axes(A,3) 
       for j in axes(A,2) 
           for i in axes(A,1)
              A[i,j,k] = A[i,j,k] - 1.0
           end
       end
    end
 end

 "Reverse iteration is slower."
 function update_slow1!(A)
    for i in axes(A,1) 
       for j in axes(A,2) 
           for k in axes(A,3)
              A[i,j,k] = A[i,j,k] - 1.0
          end
       end
    end
 end

"""
One-line nested for loops unroll as though you are reading them from top to bottom.
The last loop listed is performed first.
"""
function update_fast2!(A)
    for k in axes(A,3), j in axes(A,2), i in axes(A,1)
        A[i,j,k] = A[i,j,k] - 1.0
    end
end

"Reverse iteration is slower."
function update_slow2!(A)
    for i in axes(A,1), j in axes(A,2), k in axes(A,3)
        A[i,j,k] = A[i,j,k] - 1.0
    end
end

"""
For some reason, inside an array comprehension, the reversed order is faster.
The first loop listed is performed first.
"""
function create_fast1(A)
    B = [A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]
end

"The order from `update_fast2!` is slow in this case."
function create_slow1(A)
    B = [A[i,j,k] + 1.0 for k in axes(A,3), j in axes(A,2), i in axes(A,1)]
end

"Here the original `update_fast2!` order is fast again, but the result is a `Vector` not a 3D `Array`."
function create_fast2(A)
    B = [A[i,j,k] + 1.0 for k in axes(A,3) for j in axes(A,2) for i in axes(A,1)]
end

"Reverse order is slower. Again result is a `Vector`."
function create_slow2(A)
    B = [A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]
end
julia> A = rand(500, 500, 500);

julia> @btime update_fast1!($A);
  68.297 ms (0 allocations: 0 bytes)

julia> @btime update_slow1!($A);
  3.781 s (0 allocations: 0 bytes)

julia> @btime update_fast2!($A);
  67.938 ms (0 allocations: 0 bytes)

julia> @btime update_slow2!($A);
  3.710 s (0 allocations: 0 bytes)

julia> @btime create_fast1($A);
  275.710 ms (2 allocations: 953.67 MiB)

julia> @btime create_slow1($A);
  2.683 s (2 allocations: 953.67 MiB)

julia> @btime create_fast2($A);
  1.191 s (29 allocations: 1.14 GiB)

julia> @btime create_slow2($A);
  7.202 s (29 allocations: 1.14 GiB)
1 Like

If rules to remember is what is needed, remember that what is being iterated first is what is closer to the expression being evaluated.

IMO, it is not completely unnatural that in

[A[i,j,k] + 1.0 for i in axes(A,1), for j in axes(A,2), for k in axes(A,3)]

the first thing being iterated is for i in axes(A,1).

but I guess one also gets accustomed by that.

Then if you remove the commas you are doing a different thing, which is building an iterator product to create a vector. Then I agree that what is fast or not can be strange, seems to be that there is an arbitrary choice there. To be truth I never used (and probably never will use) that syntax.

3 Likes

Yeah, I think the awkward ordering of [(i, j) for i in 1:2 for j in 1:3] is just left over from Python, from which this particular syntax originated.

1 Like

As a performance comment, there are two primary aspects.
One is SIMD, which we won’t get in the examples that don’t have @inbounds.

The others are memory related.
Memory is fetched one cache-line at a time. Most CPUs today have 64 byte cachelines. The Apple M1 has 128 byte cachelines, and the Fujitsu A64FX has 256 byte cachelines.

Cachelines are contiguous. A 64 byte cacheline contains 8 Float64.
So if you’re iterating across columns, you’re only using 1/8 of the Float64s fetched from memory. If the arrays are large enough, that cacheline will be evicted before it comes time to use another element from it, meaning you need to fetch it again.
Hence, you can expect an 8x performance difference.

Another factor is that the prefetchers have an easier time with contiguous memory accesses.
There is also a strided prefetcher, but it requires relatively small strides and won’t prefetch across pages, and if you’re going across columns of large arrays, you’re probably iterating across pages relatively rapidly, too. Do that too much, and you’ll also run out of TLB entries, and need to start computing address translations for the mapping of virtual memory to hardware.

10 Likes

While this is interesting I don’t see how any of this is different if one has column-major vs. row-major ordering?

It is not. I was just making a comment on the consequences of accessing data in the wrong order.

That is a great way to remember it!
Iterator closest to the expression is evaluated first.

I can probably also safely ignore the last use case, since I don’t really understand what it does anyway.

1 Like

Actually, note that while in:

[A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]

the i index moves faster, in

[A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]

the opposite is true, with k moving faster.

So it’s not always true that the iterator closest to the expression is evaluated first.

3 Likes

That has been my approach so far (not necessarily consciously, since I never used that other syntax).

Yes, that last case breaks the rule (apparently because of Python). Good to highlight the inconsistency, but I don’t think most people need that syntax.

Inside out” might also be a useful way to remember.

Although that’s maybe not as obvious in the second case:

FYI One place where row major is faster is lazy matrix-vector multiplication getindex:

julia> n = 20_000; A = randn(n,n); b = randn(n);

julia> @time ApplyVector(*, A, b)[100]
  0.004955 seconds (27 allocations: 2.188 KiB)
180.974127516603

julia> @time ApplyVector(*, A', b)[100]
  0.000492 seconds (31 allocations: 2.250 KiB)
-50.2513375406383
3 Likes

Do you have an explanation for this?

Cache locality (specifically spatial locality, i.e. cache-line utilization). You are essentially computing dot(@view(A[i, :]), b), which is faster if A[i, :] is contiguous in memory (i.e. if A is row-major).

7 Likes

Another interesting example are integrations.

In the case of multiple integrations (ex. triple integrals), we usually think of the “fastest” as the last (rightmost) index.

1 Like

Don’t you mean leftmost? When you integrate f(x, y, z) dx dy dz you start by integrating (“iterating” over) x, then over y, then z.

Actually, thank you for bringing this up, I think this is how I’ll finally be able to remember how Julia iterates over an array.

1 Like

C++23 actually refers to column and row major as “layout_left” and “layout_right”, respectively.

4 Likes

Re “surprising”

A mnemonic that might help with this is to visualize that in both forms, the lexically “inside” variable (eg the first variable i in the tail of the generator expression, or the second variable j in head of the statement form) forms the inner loop and varies fastest.

1 Like

How is one supposed to tell that i is lexically “inside” in for i=1:5, j=1:5, though? Especially since the behavior for the other version is the opposite:

[... for i=1:5, j=1:5]
[... for i=1:5 for j=1:5]

I don’t see how the syntax would allow you to tell the difference here.

eek I wasn’t aware of that last form (4 below), which my mnemonic fails for. Color me surprised! Thanks for setting me straight, and sorry to derail old topic.

for i=1:2, j=3:4; f(i, j) end # (1) i lexically outside, controls outer loop
for i=1:2 for j=3:4; f(i, j) end end # (2) i lexically outside, controls outer loop
[f(i, j) for i=1:2, j=3:4]; # (3) i lexically inside, controls inner loop
[f(i, j) for i=1:2 for j=3:4]; # (4) i lexically inside, controls outer loop
1 Like