# Array indexing/slicing is slow

I am doing very simple calculations in loops. It looks like the array extraction is the most expensive part of this. Is there an obvious way to speed it up? Here is a MWE of what I am doing. Any help would be greatly appreciated!

`Is1` and `Is2` are matrices of indices, and `Pis1` and `Pis2` are matrices of probabilities. I compute the following

``````Threads.@threads for iA = 1:Nstates
@simd for iy = 1:nY
@simd for iP = 1:nP
@simd for ip = 1:np
# relevant continuation value
I1, PI1, I2, PI2 =  Is1[iy, iP, ip, iA], Pis1[iy, iP, ip, iA], Is2[iy, iP, ip, iA], Pis2[iy, iP, ip, iA]
Xcont = compute_cont(X, I1, PI1, I2, PI2 );
# then I use Xcont in other computations but the above line is what is slow
end
end
end
end

function compute_cont(X, I1, I2, PI1, PI2)
I1_next, I2_next = min(nY, I1+1), min(nP, I2+1);

output = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2
+ X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);

return output
end

``````

Every time you execute this line, you are allocating a dozen temporary arrays. Read about fusing array operations in Julia, and write into a pre-allocated `Xcont` array.

Also, put all of your loops into functions.

(Also think about your dimension ordering — ideally, order your arrays so that you are doing contiguous slices and other contiguous access, i.e. use the first dimensions for slices and inner loops.)

6 Likes

Thank you! Just a couple of points:

• what is pre-allocating `Xcont` gives an out-of-memory error?

• I try re-defining the `compute_cont` function as follows (which I think should not reallocate the temporary arrays) but it actually takes much longer… Maybe I am “fusing” the arrays incorrectly.

``````function compute_cont(X, I1, I2, PI1, PI2)
I1_next, I2_next = min(nY, I1+1), min(nP, I2+1);
output = zeros(size(X, 1), size(X, 2), size(X, 5));
for i=1:size(X, 1)
for j=1:size(X, 2)
for k=1:size(X, 5)
output = @views X[i, j, I1, I2, k] * PI1 * PI2  + X[i, j, I1_next, I2, k] * (1 - PI1) * PI2
+ X[i, j, I1, I2_next, k] * PI1 * (1 - PI2) + X[i, j, I1_next, I2_next, k] * (1 - PI1) * (1 - PI2);
end
end
end

return output
end

``````
``````  output[i, j, k] =
``````

Yes, sorry: I have that in the code (just a mistake reporting it here).

Allocate this outside this function, and pass it as a parameter. Allocate just one per thread.

(Also invert the order of the loops, run over i first, then j, then k)

can you provide a minimal working example (MWE)?

Like providing the correct datatypes and shapes for `X, I1, ...`

Hello,

Yes: this should do it. Thank you in advance for your help!

``````Is1 = rand(1:3,3,3,3,5);
Pis1 = rand(3,3,3,5);
Is2 = rand(1:3,3,3,3,5);
Pis2 = rand(3,3,3,5);

X = rand(3,3,3,3,5);

function compute_cont(X, I1, I2, PI1, PI2)
I1_next, I2_next = min(3, I1+1), min(3, I2+1);

output = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2  + X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);

return output
end

for i = 1:3
for j = 1:3
for k = 1:3
for w = 1:5
# relevant continuation value
I1, PI1, I2, PI2 =  Is1[i,j,k,w], Pis1[i,j,k,w], Is2[i,j,k,w], Pis2[i,j,k,w];
Xcont = compute_cont(X, I1, I2, PI1, PI2);
# some other operations occur on Xcont
end
end
end
end

``````

Some rewrite + minimal optimization to better conform to must-read: Performance Tips · The Julia Language

``````Is1 = rand(1:3,3,3,3,5);
Pis1 = rand(3,3,3,5);
Is2 = rand(1:3,3,3,3,5);
Pis2 = rand(3,3,3,5);

X = rand(3,3,3,3,5);

function compute_cont!(Xcont, X, I1, I2, PI1, PI2)
I1_next, I2_next = min(3, I1+1), min(3, I2+1);

@. Xcont = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2  + X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);

return Xcont
end

function do_thing(X, Is1, Pis1, Is2,Pis2)
Xcont = similar(@view X[:,:,1,1,:])
@inbounds for I in CartesianIndices(Is1)
# relevant continuation value
I1, PI1, I2, PI2 =  Is1[I], Pis1[I], Is2[I], Pis2[I];
compute_cont!(Xcont, X, I1, I2, PI1, PI2);
# some other operations occur on Xcont
end
return Xcont
end

using BenchmarkTools
@btime do_thing(X, Is1, Pis1, Is2, Pis2);
``````
1 Like

Echoing what @stevengj said, make sure to put the loops intto a function and fuse the loops. Simply changing

``````output = @views X....
``````

to

``````output = @. @views X....
``````

took the runtime on my computer from 90 ms to 21 ms, so a nearly 5x speed up by adding two characters.

(@Dan beat me to it and his in-place version is probably even faster)