Sure, I was thinking of plain numeric arrays, tables would be different.

`A = rand(2,2)'`

is a row major matrix in natural Julia syntax.

Let me add my my 2¢ too. I confess the issue and the arguments initially made my brain fry, it is indeed quite interesting. But this is how I view it now.

With the notational convention `A[i,j]`

we tend to call `i`

the *row index* and `j`

the *column index*. Taking into consideration that Julia is *column-major*, it seems weird (or at least inconsistent or inconvenient) to some that the first index corresponds to a row, while it is only the second index that corresponds to a column. Perceived weirdness of this convention is even magnified when another dimension is added and we have `A[i,j,k]`

. This may now look inconsistent because asymmetric. On one hand we say that the prioritized (fast) dimension is columns, then rows and then… how shall we call it… say, floors (with a reference to a virtual building where we put one matrix on top of another). But on the other hand we order the indices as rows, columns, floor, as in `A[i,j,k]`

. Weird.

Or not? In fact, I no longer think so. I find it quite consistent. But I think it helps to view the ** i as an index with which we move along the first dimension** (that is, along a given column) rather than just the index of a row. Similarly

`j`

is used to move along a given row. And `k`

is used to move… welll, vertically (again, with the reference to a virtual building). The perceived asymetry then disappears, doesn’t it? It is not *rows → columns → floors*but rather

*indexing/moving along the given column → along the row → along… well… simply moving vertically*. It seems to me perfectly consistent with column-major format.

Perhaps my displayed troubles to find a descriptive word for the third direction can partially explain where the confusion comes from. `k`

is an index of a floor and by increasing `k`

we are moving along… along some “vertical”. Similarly `i`

is an index of a row but by increasing it we are moving along a column.

Okay, I’m confused now. What is the fast way to access/create `A[i,j,k]`

in Julia …

- with nested
`for`

loops on different lines - with nested
`for`

loops on one line - with an array comprehension

… and how do you reason through their order?

I understand what column major storage means, but not necessarily the syntax for accessing it.

(Btw, I looked for this information in the Performance and Multi-dimensional Arrays sections of the docs but couldn’t find it.)

this may help dispel the confusion

Each one remembers this as he/she wants. But one way it to remember that: run first over the first index:

```
julia> function update_fast!(A)
for k in axes(A,3)
for j in axes(A,2)
for i in axes(A,1) # <--- running first
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
update_fast! (generic function with 1 method)
julia> function update_slow!(A)
for i in axes(A,1)
for j in axes(A,2)
for k in axes(A,3) # <--- running first
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
update_slow! (generic function with 1 method)
julia> A = rand(50,50,50);
julia> using BenchmarkTools
julia> @btime update_fast!($A);
22.939 μs (0 allocations: 0 bytes)
julia> @btime update_slow!($A);
138.349 μs (0 allocations: 0 bytes)
```

these two doesn’t generate different code, just different in style.

Thanks @lmiq. I’ll use your framework to attempt to answer my own question, though I still don’t understand some of the weirdness illustrated by OP.

```
using BenchmarkTools
"""
Innermost loop is performed first.
Remember "Run first over first index."
Higher dimensions should appear last in indexing and as the topmost `for` loop.
"""
function update_fast1!(A)
for k in axes(A,3)
for j in axes(A,2)
for i in axes(A,1)
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
"Reverse iteration is slower."
function update_slow1!(A)
for i in axes(A,1)
for j in axes(A,2)
for k in axes(A,3)
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
"""
One-line nested for loops unroll as though you are reading them from top to bottom.
The last loop listed is performed first.
"""
function update_fast2!(A)
for k in axes(A,3), j in axes(A,2), i in axes(A,1)
A[i,j,k] = A[i,j,k] - 1.0
end
end
"Reverse iteration is slower."
function update_slow2!(A)
for i in axes(A,1), j in axes(A,2), k in axes(A,3)
A[i,j,k] = A[i,j,k] - 1.0
end
end
"""
For some reason, inside an array comprehension, the reversed order is faster.
The first loop listed is performed first.
"""
function create_fast1(A)
B = [A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]
end
"The order from `update_fast2!` is slow in this case."
function create_slow1(A)
B = [A[i,j,k] + 1.0 for k in axes(A,3), j in axes(A,2), i in axes(A,1)]
end
"Here the original `update_fast2!` order is fast again, but the result is a `Vector` not a 3D `Array`."
function create_fast2(A)
B = [A[i,j,k] + 1.0 for k in axes(A,3) for j in axes(A,2) for i in axes(A,1)]
end
"Reverse order is slower. Again result is a `Vector`."
function create_slow2(A)
B = [A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]
end
```

```
julia> A = rand(500, 500, 500);
julia> @btime update_fast1!($A);
68.297 ms (0 allocations: 0 bytes)
julia> @btime update_slow1!($A);
3.781 s (0 allocations: 0 bytes)
julia> @btime update_fast2!($A);
67.938 ms (0 allocations: 0 bytes)
julia> @btime update_slow2!($A);
3.710 s (0 allocations: 0 bytes)
julia> @btime create_fast1($A);
275.710 ms (2 allocations: 953.67 MiB)
julia> @btime create_slow1($A);
2.683 s (2 allocations: 953.67 MiB)
julia> @btime create_fast2($A);
1.191 s (29 allocations: 1.14 GiB)
julia> @btime create_slow2($A);
7.202 s (29 allocations: 1.14 GiB)
```

If rules to remember is what is needed, remember that what is being iterated first is what is closer to the expression being evaluated.

IMO, it is not completely unnatural that in

```
[A[i,j,k] + 1.0 for i in axes(A,1), for j in axes(A,2), for k in axes(A,3)]
```

the first thing being iterated is `for i in axes(A,1)`

.

but I guess one also gets accustomed by that.

Then if you remove the commas you are doing a different thing, which is building an iterator product to create a vector. Then I agree that what is fast or not can be strange, seems to be that there is an arbitrary choice there. To be truth I never used (and probably never will use) that syntax.

Yeah, I think the awkward ordering of `[(i, j) for i in 1:2 for j in 1:3]`

is just left over from Python, from which this particular syntax originated.

As a performance comment, there are two primary aspects.

One is SIMD, which we won’t get in the examples that don’t have `@inbounds`

.

The others are memory related.

Memory is fetched one cache-line at a time. Most CPUs today have 64 byte cachelines. The Apple M1 has 128 byte cachelines, and the Fujitsu A64FX has 256 byte cachelines.

Cachelines are contiguous. A 64 byte cacheline contains `8`

`Float64`

.

So if you’re iterating across columns, you’re only using 1/8 of the `Float64`

s fetched from memory. If the arrays are large enough, that cacheline will be evicted before it comes time to use another element from it, meaning you need to fetch it again.

Hence, you can expect an 8x performance difference.

Another factor is that the prefetchers have an easier time with contiguous memory accesses.

There is also a strided prefetcher, but it requires relatively small strides and won’t prefetch across pages, and if you’re going across columns of large arrays, you’re probably iterating across pages relatively rapidly, too. Do that too much, and you’ll also run out of TLB entries, and need to start computing address translations for the mapping of virtual memory to hardware.

While this is interesting I don’t see how any of this is different if one has column-major vs. row-major ordering?

It is not. I was just making a comment on the consequences of accessing data in the wrong order.

That is a great way to remember it!

*Iterator closest to the expression is evaluated first.*

I can probably also safely ignore the last use case, since I don’t really understand what it does anyway.

Actually, note that while in:

```
[A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]
```

the `i`

index moves faster, in

```
[A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]
```

the opposite is true, with `k`

moving faster.

So it’s not always true that the iterator closest to the expression is evaluated first.

That has been my approach so far (not necessarily consciously, since I never used that other syntax).

Yes, that last case breaks the rule (apparently because of Python). Good to highlight the inconsistency, but I don’t think most people need that syntax.

“*Inside out*” might also be a useful way to remember.

Although that’s maybe not as obvious in the second case:

FYI One place where row major is faster is lazy matrix-vector multiplication `getindex`

:

```
julia> n = 20_000; A = randn(n,n); b = randn(n);
julia> @time ApplyVector(*, A, b)[100]
0.004955 seconds (27 allocations: 2.188 KiB)
180.974127516603
julia> @time ApplyVector(*, A', b)[100]
0.000492 seconds (31 allocations: 2.250 KiB)
-50.2513375406383
```