# Access arrays in memory order, along columns

I’m trying to wrap my head around how I best should iterate over arrays. I’m coming to Julia after primarily using MATLAB and as both MATLAB and Julia are column major arrays should be accessed in the exact same order. My understanding from MATLAB is that I below have ordered the loops for optimal speed.

``````a = 1:25*50*75;
a = reshape(a, 25, 50, 75);
for k = 1:size(a, 3)
for j = 1:size(a, 2)
for i = 1:size(a, 1)
a(i, j, k) = i + j + k;
end
end
end
``````

Should the loops be ordered the same way in Julia?

``````a = collect(1:25*50*75)
a = reshape(a, (25, 50, 75))
for k = 1:size(a, 3)
for j = 1:size(a, 2)
for i = 1:size(a, 1)
a[i, j, k] = i + j + k
end
end
end
``````

Yes.

ps: you can use `for i in axes(a,1)` for example, such that the loop is independent on how the array is indexed (for example, not starting at 1 - not common, but possible).

4 Likes

You can also use `eachindex` or `CartesianIndices` or `pairs` for this sort of thing, depending on the calculations you want to perform. They will choose the right order for you. You shouldn’t have to think about this too often.

In most cases, do

``````for i in eachindex(a)
a[i] = ...
end
``````

If you need the actual indices, like in your example,

``````for ind in CartesianIndices(a)
a[ind] = sum(Tuple(ind))
end
``````

These work for arrays of arbitrary dimensions.

There are several indexing functions that make working with arrays much nicer than in Matlab.

6 Likes

Besides what others have suggested, in Julia you can write this in a single line without worrying about column or row order and get the best perormance. This is possible using packages like `Tullio`:

``````using Tullio
a = collect(1:25*50*75)
a = reshape(a, 25, 50, 75)
@tullio a[i, j, k] = i + j + k
``````
2 Likes

This Tullio version takes only 26.500 μs on my machine, while the multi-threaded version

``````Threads.@threads for k in axes(a, 3)
for j in axes(a, 2)
for i in axes(a, 1)
a[i, j, k] = i + j + k
end
end
end
``````

takes 2.954 ms. How can Tullio achieve that speed?

Tullio doesn’t do black magic, it should be equivalent to properly ordered loops, be sure you added `-t auto` as compiler option. Multihtreading should not do much here because this problem is essentially memory-bound rather than cpu-bound. I observe only 2.5x speedup on a 4-core/8-threads machine.

``````using Tullio

function do_loops!(a)
for j in axes(a, 2)
for i in axes(a, 1)
a[i, j, k] = i + j + k
end
end
end
end

a = collect(1:25*50*75)
a = reshape(a, (25, 50, 75))
@btime do_loops!(\$a)
# 24.900 μs (0 allocations: 0 bytes)   - Serial
# 10.600 μs (48 allocations: 5.08 KiB) - Parallel
@btime @tullio \$a[i, j, k] = i + j + k
# 25.800 μs (0 allocations: 0 bytes)
``````
2 Likes

I see. I forgot to put the for loops in a function.

1 Like

`@tturbo` from `LoopVectorization` sometimes does magic in situations like these, because it has a allocation-free and fast threading mechanism (which can be found in `Polyester`).

`@inbounds` may also help, although the use `axes` guarantees that already, and thus it may not make any difference here.

2 Likes

I would also suggest that since you aren’t using the original values stored in `a`, you change the above Matlabish code to a more Julian form:

``````a = zeros(Int, 25, 50, 75)
``````

or

``````a = Array{Int, 3}(undef, 25, 50, 75)
``````

The first form is more convenient to write and initializes `a` to all zeros. The second form may be a bit faster and leaves the contents of `a` unspecified.

1 Like

Multithreading should not do much here

The strange thing is that when I run your code on my computer, I don’t get any improvement from the parallel version with `Threads.@threads` - in fact, sometimes worse - but I do get a significant improvement `using Distributed` `@distributed`. On my 8 core, 16 threads Windows laptop, Tullio is slowest, with time for `@distributed` < `@tturbo` < `@tullio`, each with about a factor 10 difference.

1 Like

If you need `i+1`, `j-1`, etc. on the right-hand side, is this the suggested method?

``````for ind in CartesianIndices(a)
(i, j, k) = Tuple(ind)
a[ind] = (a[i+1, j, k] + a[i, j-1, k])/2
end
``````

I don’t see `pairs` discussed in the array indexing documentation or blog post. I interpret the `pairs` docstring to mean it should be used if you need indices and values at the same time and that `pairs` is always preferred over `enumerate`. Is that correct?

This is going to fail, since `i-1` and `i+1` must necessarily go out of bounds at the edges. You’ll have to figure out some way to peel of the edge values.

`pairs` will give you the value and the index/indices. Enumerate will just count from one, so it depends on what you need. If you are looking for indices, use `pairs`.

2 Likes

`eachindex` and `CartesianIndices` are great, but often I have some computations that are unique only to one or two dimensions. And, so as not to repeat the computations on each iteration of the remaining dimensions, I need to explicitly loop over the array myself.

I am new to Distributed, so could you give an example code about the problem in the OP? Thanks.