Thanks @lmiq. I’ll use your framework to attempt to answer my own question, though I still don’t understand some of the weirdness illustrated by OP.
using BenchmarkTools
"""
Innermost loop is performed first.
Remember "Run first over first index."
Higher dimensions should appear last in indexing and as the topmost `for` loop.
"""
function update_fast1!(A)
for k in axes(A,3)
for j in axes(A,2)
for i in axes(A,1)
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
"Reverse iteration is slower."
function update_slow1!(A)
for i in axes(A,1)
for j in axes(A,2)
for k in axes(A,3)
A[i,j,k] = A[i,j,k] - 1.0
end
end
end
end
"""
One-line nested for loops unroll as though you are reading them from top to bottom.
The last loop listed is performed first.
"""
function update_fast2!(A)
for k in axes(A,3), j in axes(A,2), i in axes(A,1)
A[i,j,k] = A[i,j,k] - 1.0
end
end
"Reverse iteration is slower."
function update_slow2!(A)
for i in axes(A,1), j in axes(A,2), k in axes(A,3)
A[i,j,k] = A[i,j,k] - 1.0
end
end
"""
For some reason, inside an array comprehension, the reversed order is faster.
The first loop listed is performed first.
"""
function create_fast1(A)
B = [A[i,j,k] + 1.0 for i in axes(A,1), j in axes(A,2), k in axes(A,3)]
end
"The order from `update_fast2!` is slow in this case."
function create_slow1(A)
B = [A[i,j,k] + 1.0 for k in axes(A,3), j in axes(A,2), i in axes(A,1)]
end
"Here the original `update_fast2!` order is fast again, but the result is a `Vector` not a 3D `Array`."
function create_fast2(A)
B = [A[i,j,k] + 1.0 for k in axes(A,3) for j in axes(A,2) for i in axes(A,1)]
end
"Reverse order is slower. Again result is a `Vector`."
function create_slow2(A)
B = [A[i,j,k] + 1.0 for i in axes(A,1) for j in axes(A,2) for k in axes(A,3)]
end
julia> A = rand(500, 500, 500);
julia> @btime update_fast1!($A);
68.297 ms (0 allocations: 0 bytes)
julia> @btime update_slow1!($A);
3.781 s (0 allocations: 0 bytes)
julia> @btime update_fast2!($A);
67.938 ms (0 allocations: 0 bytes)
julia> @btime update_slow2!($A);
3.710 s (0 allocations: 0 bytes)
julia> @btime create_fast1($A);
275.710 ms (2 allocations: 953.67 MiB)
julia> @btime create_slow1($A);
2.683 s (2 allocations: 953.67 MiB)
julia> @btime create_fast2($A);
1.191 s (29 allocations: 1.14 GiB)
julia> @btime create_slow2($A);
7.202 s (29 allocations: 1.14 GiB)