Why column major?

Sure, I was thinking of plain numeric arrays, tables would be different.

1 Like

7 posts were split to a new topic: Row vs. column major for arrays of coordinate vectors

A = rand(2,2)' is a row major matrix in natural Julia syntax.

2 Likes

Let me add my my 2¢ too. I confess the issue and the arguments initially made my brain fry, it is indeed quite interesting. But this is how I view it now.

With the notational convention A[i,j] we tend to call i the row index and j the column index. Taking into consideration that Julia is column-major, it seems weird (or at least inconsistent or inconvenient) to some that the first index corresponds to a row, while it is only the second index that corresponds to a column. Perceived weirdness of this convention is even magnified when another dimension is added and we have A[i,j,k]. This may now look inconsistent because asymmetric. On one hand we say that the prioritized (fast) dimension is columns, then rows and then… how shall we call it… say, floors (with a reference to a virtual building where we put one matrix on top of another). But on the other hand we order the indices as rows, columns, floor, as in A[i,j,k]. Weird.

Or not? In fact, I no longer think so. I find it quite consistent. But I think it helps to view the i as an index with which we move along the first dimension (that is, along a given column) rather than just the index of a row. Similarly j is used to move along a given row. And k is used to move… welll, vertically (again, with the reference to a virtual building). The perceived asymetry then disappears, doesn’t it? It is not rows → columns → floors but rather indexing/moving along the given column → along the row → along… well… simply moving vertically. It seems to me perfectly consistent with column-major format.

Perhaps my displayed troubles to find a descriptive word for the third direction can partially explain where the confusion comes from. k is an index of a floor and by increasing k we are moving along… along some “vertical”. Similarly i is an index of a row but by increasing it we are moving along a column.

5 Likes

Okay, I’m confused now. What is the fast way to access/create A[i,j,k] in Julia …

  1. with nested for loops on different lines
  2. with nested for loops on one line
  3. with an array comprehension

… and how do you reason through their order?

I understand what column major storage means, but not necessarily the syntax for accessing it.
(Btw, I looked for this information in the Performance and Multi-dimensional Arrays sections of the docs but couldn’t find it.)

1 Like

this may help dispel the confusion

2 Likes

Each one remembers this as he/she wants. But one way it to remember that: run first over the first index:

julia> function update_fast!(A)
          for k in axes(A,3) 
             for j in axes(A,2) 
                 for i in axes(A,1) # <--- running first
                    A[i,j,k] = A[i,j,k] - 1.0
                 end
             end
          end
       end
update_fast! (generic function with 1 method)

julia> function update_slow!(A)
          for i in axes(A,1) 
             for j in axes(A,2) 
                 for k in axes(A,3) # <--- running first
                    A[i,j,k] = A[i,j,k] - 1.0
                end
             end
          end
       end
update_slow! (generic function with 1 method)

julia> A = rand(50,50,50);

julia> using BenchmarkTools

julia> @btime update_fast!($A);
  22.939 μs (0 allocations: 0 bytes)

julia> @btime update_slow!($A);
  138.349 μs (0 allocations: 0 bytes)

7 Likes

these two doesn’t generate different code, just different in style.

2 Likes

Thanks @lmiq. I’ll use your framework to attempt to answer my own question, though I still don’t understand some of the weirdness illustrated by OP.

using BenchmarkTools

"""
Innermost loop is performed first.
Remember "Run first over first index."
Higher dimensions should appear last in indexing and as the topmost `for` loop.
"""
function update_fast1!(A)
    for k in axes(A,3) 
       for j in axes(A,2) 
           for i in axes(A,1)
              A[i,j,k] = A[i,j,k] - 1.0
           end
       end
    end
 end

 "Reverse iteration is slower."
 function update_slow1!(A)
    for i in axes(A,1) 
       for j in axes(A,2) 
           for k in axes(A,3)
              A[i,j,k] = A[i,j,k] - 1.0
          end
       end
    end
 end

"""
One-line nested for loops unroll as though you are reading them from top to bottom.
The last loop listed is performed first.
"""
function update_fast2!(A)
    for k in axes(A,3), j in axes(A,2), i in axes(A,1)
        A[i,j,k] = A[i,j,k] - 1.0
    end
end

"Reverse iteration is slower."
function update_slow2!(A)
    for i in axes(A,1), j in axes(A,2), k in axes(A,3)
        A[i,j,k] = A[i,j,k] - 1.0
    end
end

"""
For some reason, inside an array comprehension, the reversed order is faster.
The first loop listed is performed first.
"""
function create_fast1(A)
    B = [A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]
end

"The order from `update_fast2!` is slow in this case."
function create_slow1(A)
    B = [A[i,j,k] + 1.0 for k in axes(A,3), j in axes(A,2), i in axes(A,1)]
end

"Here the original `update_fast2!` order is fast again, but the result is a `Vector` not a 3D `Array`."
function create_fast2(A)
    B = [A[i,j,k] + 1.0 for k in axes(A,3) for j in axes(A,2) for i in axes(A,1)]
end

"Reverse order is slower. Again result is a `Vector`."
function create_slow2(A)
    B = [A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]
end
julia> A = rand(500, 500, 500);

julia> @btime update_fast1!($A);
  68.297 ms (0 allocations: 0 bytes)

julia> @btime update_slow1!($A);
  3.781 s (0 allocations: 0 bytes)

julia> @btime update_fast2!($A);
  67.938 ms (0 allocations: 0 bytes)

julia> @btime update_slow2!($A);
  3.710 s (0 allocations: 0 bytes)

julia> @btime create_fast1($A);
  275.710 ms (2 allocations: 953.67 MiB)

julia> @btime create_slow1($A);
  2.683 s (2 allocations: 953.67 MiB)

julia> @btime create_fast2($A);
  1.191 s (29 allocations: 1.14 GiB)

julia> @btime create_slow2($A);
  7.202 s (29 allocations: 1.14 GiB)
1 Like

If rules to remember is what is needed, remember that what is being iterated first is what is closer to the expression being evaluated.

IMO, it is not completely unnatural that in

[A[i,j,k] + 1.0 for i in axes(A,1), for j in axes(A,2), for k in axes(A,3)]

the first thing being iterated is for i in axes(A,1).

but I guess one also gets accustomed by that.

Then if you remove the commas you are doing a different thing, which is building an iterator product to create a vector. Then I agree that what is fast or not can be strange, seems to be that there is an arbitrary choice there. To be truth I never used (and probably never will use) that syntax.

3 Likes

Yeah, I think the awkward ordering of [(i, j) for i in 1:2 for j in 1:3] is just left over from Python, from which this particular syntax originated.

1 Like

As a performance comment, there are two primary aspects.
One is SIMD, which we won’t get in the examples that don’t have @inbounds.

The others are memory related.
Memory is fetched one cache-line at a time. Most CPUs today have 64 byte cachelines. The Apple M1 has 128 byte cachelines, and the Fujitsu A64FX has 256 byte cachelines.

Cachelines are contiguous. A 64 byte cacheline contains 8 Float64.
So if you’re iterating across columns, you’re only using 1/8 of the Float64s fetched from memory. If the arrays are large enough, that cacheline will be evicted before it comes time to use another element from it, meaning you need to fetch it again.
Hence, you can expect an 8x performance difference.

Another factor is that the prefetchers have an easier time with contiguous memory accesses.
There is also a strided prefetcher, but it requires relatively small strides and won’t prefetch across pages, and if you’re going across columns of large arrays, you’re probably iterating across pages relatively rapidly, too. Do that too much, and you’ll also run out of TLB entries, and need to start computing address translations for the mapping of virtual memory to hardware.

6 Likes

While this is interesting I don’t see how any of this is different if one has column-major vs. row-major ordering?

It is not. I was just making a comment on the consequences of accessing data in the wrong order.

That is a great way to remember it!
Iterator closest to the expression is evaluated first.

I can probably also safely ignore the last use case, since I don’t really understand what it does anyway.

1 Like

Actually, note that while in:

[A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]

the i index moves faster, in

[A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]

the opposite is true, with k moving faster.

So it’s not always true that the iterator closest to the expression is evaluated first.

3 Likes

That has been my approach so far (not necessarily consciously, since I never used that other syntax).

Yes, that last case breaks the rule (apparently because of Python). Good to highlight the inconsistency, but I don’t think most people need that syntax.

Inside out” might also be a useful way to remember.

Although that’s maybe not as obvious in the second case:

FYI One place where row major is faster is lazy matrix-vector multiplication getindex:

julia> n = 20_000; A = randn(n,n); b = randn(n);

julia> @time ApplyVector(*, A, b)[100]
  0.004955 seconds (27 allocations: 2.188 KiB)
180.974127516603

julia> @time ApplyVector(*, A', b)[100]
  0.000492 seconds (31 allocations: 2.250 KiB)
-50.2513375406383
1 Like