Access arrays in memory order, along columns

I’m trying to wrap my head around how I best should iterate over arrays. I’m coming to Julia after primarily using MATLAB and as both MATLAB and Julia are column major arrays should be accessed in the exact same order. My understanding from MATLAB is that I below have ordered the loops for optimal speed.

a = 1:25*50*75;
a = reshape(a, 25, 50, 75);
for k = 1:size(a, 3)
    for j = 1:size(a, 2)
        for i = 1:size(a, 1)
            a(i, j, k) = i + j + k;
        end
    end
end

Should the loops be ordered the same way in Julia?

a = collect(1:25*50*75)
a = reshape(a, (25, 50, 75))
for k = 1:size(a, 3)
    for j = 1:size(a, 2)
        for i = 1:size(a, 1)
            a[i, j, k] = i + j + k
        end
    end
end

Yes.

ps: you can use for i in axes(a,1) for example, such that the loop is independent on how the array is indexed (for example, not starting at 1 - not common, but possible).

4 Likes

You can also use eachindex or CartesianIndices or pairs for this sort of thing, depending on the calculations you want to perform. They will choose the right order for you. You shouldn’t have to think about this too often.

In most cases, do

for i in eachindex(a)
    a[i] = ...
end

If you need the actual indices, like in your example,

for ind in CartesianIndices(a)
    a[ind] = sum(Tuple(ind))
end

These work for arrays of arbitrary dimensions.

There are several indexing functions that make working with arrays much nicer than in Matlab.

6 Likes

Besides what others have suggested, in Julia you can write this in a single line without worrying about column or row order and get the best perormance. This is possible using packages like Tullio:

using Tullio
a = collect(1:25*50*75)
a = reshape(a, 25, 50, 75)
@tullio a[i, j, k] = i + j + k
2 Likes

This Tullio version takes only 26.500 μs on my machine, while the multi-threaded version

Threads.@threads for k in axes(a, 3)
    for j in axes(a, 2)
        for i in axes(a, 1)
            a[i, j, k] = i + j + k
        end
    end
end

takes 2.954 ms. How can Tullio achieve that speed?

Tullio doesn’t do black magic, it should be equivalent to properly ordered loops, be sure you added -t auto as compiler option. Multihtreading should not do much here because this problem is essentially memory-bound rather than cpu-bound. I observe only 2.5x speedup on a 4-core/8-threads machine.

using Tullio

function do_loops!(a) 
    Threads.@threads for k in axes(a, 3)
        for j in axes(a, 2)
            for i in axes(a, 1)
                a[i, j, k] = i + j + k
            end
        end
    end
end

a = collect(1:25*50*75)
a = reshape(a, (25, 50, 75))
@btime do_loops!($a) 
  # 24.900 μs (0 allocations: 0 bytes)   - Serial
  # 10.600 μs (48 allocations: 5.08 KiB) - Parallel
@btime @tullio $a[i, j, k] = i + j + k
  # 25.800 μs (0 allocations: 0 bytes)
2 Likes

I see. I forgot to put the for loops in a function.

1 Like

@tturbo from LoopVectorization sometimes does magic in situations like these, because it has a allocation-free and fast threading mechanism (which can be found in Polyester).

@inbounds may also help, although the use axes guarantees that already, and thus it may not make any difference here.

2 Likes

I would also suggest that since you aren’t using the original values stored in a, you change the above Matlabish code to a more Julian form:

a = zeros(Int, 25, 50, 75)

or

a = Array{Int, 3}(undef, 25, 50, 75)

The first form is more convenient to write and initializes a to all zeros. The second form may be a bit faster and leaves the contents of a unspecified.

1 Like

Multithreading should not do much here

The strange thing is that when I run your code on my computer, I don’t get any improvement from the parallel version with Threads.@threads - in fact, sometimes worse - but I do get a significant improvement using Distributed @distributed. On my 8 core, 16 threads Windows laptop, Tullio is slowest, with time for @distributed < @tturbo < @tullio, each with about a factor 10 difference.

1 Like

If you need i+1, j-1, etc. on the right-hand side, is this the suggested method?

for ind in CartesianIndices(a)
    (i, j, k) = Tuple(ind)
    a[ind] = (a[i+1, j, k] + a[i, j-1, k])/2
end

I don’t see pairs discussed in the array indexing documentation or blog post. I interpret the pairs docstring to mean it should be used if you need indices and values at the same time and that pairs is always preferred over enumerate. Is that correct?

This is going to fail, since i-1 and i+1 must necessarily go out of bounds at the edges. You’ll have to figure out some way to peel of the edge values.

pairs will give you the value and the index/indices. Enumerate will just count from one, so it depends on what you need. If you are looking for indices, use pairs.

2 Likes

eachindex and CartesianIndices are great, but often I have some computations that are unique only to one or two dimensions. And, so as not to repeat the computations on each iteration of the remaining dimensions, I need to explicitly loop over the array myself.

I am new to Distributed, so could you give an example code about the problem in the OP? Thanks.