Array indexing/slicing is slow

I am doing very simple calculations in loops. It looks like the array extraction is the most expensive part of this. Is there an obvious way to speed it up? Here is a MWE of what I am doing. Any help would be greatly appreciated!

Is1 and Is2 are matrices of indices, and Pis1 and Pis2 are matrices of probabilities. I compute the following

Threads.@threads for iA = 1:Nstates            
    @simd for iy = 1:nY
        @simd for iP = 1:nP
            @simd for ip = 1:np
                # relevant continuation value
                I1, PI1, I2, PI2 =  Is1[iy, iP, ip, iA], Pis1[iy, iP, ip, iA], Is2[iy, iP, ip, iA], Pis2[iy, iP, ip, iA]
                Xcont = compute_cont(X, I1, PI1, I2, PI2 );
                # then I use Xcont in other computations but the above line is what is slow
            end
        end
    end
end


function compute_cont(X, I1, I2, PI1, PI2)
    I1_next, I2_next = min(nY, I1+1), min(nP, I2+1);
    
    output = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2 
                    + X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);
    
    return output
end

Every time you execute this line, you are allocating a dozen temporary arrays. Read about fusing array operations in Julia, and write into a pre-allocated Xcont array.

Also, put all of your loops into functions.

(Also think about your dimension ordering — ideally, order your arrays so that you are doing contiguous slices and other contiguous access, i.e. use the first dimensions for slices and inner loops.)

6 Likes

Thank you! Just a couple of points:

  • what is pre-allocating Xcont gives an out-of-memory error?

  • I try re-defining the compute_cont function as follows (which I think should not reallocate the temporary arrays) but it actually takes much longer… Maybe I am “fusing” the arrays incorrectly.

function compute_cont(X, I1, I2, PI1, PI2)
    I1_next, I2_next = min(nY, I1+1), min(nP, I2+1);
    output = zeros(size(X, 1), size(X, 2), size(X, 5));
    for i=1:size(X, 1)
        for j=1:size(X, 2)
            for k=1:size(X, 5)
                output = @views X[i, j, I1, I2, k] * PI1 * PI2  + X[i, j, I1_next, I2, k] * (1 - PI1) * PI2 
                                + X[i, j, I1, I2_next, k] * PI1 * (1 - PI2) + X[i, j, I1_next, I2_next, k] * (1 - PI1) * (1 - PI2);
            end
        end
    end
    
    return output
end

  output[i, j, k] = 

Yes, sorry: I have that in the code (just a mistake reporting it here).

Allocate this outside this function, and pass it as a parameter. Allocate just one per thread.

(Also invert the order of the loops, run over i first, then j, then k)

can you provide a minimal working example (MWE)?

Like providing the correct datatypes and shapes for X, I1, ...

Hello,

Yes: this should do it. Thank you in advance for your help!

Is1 = rand(1:3,3,3,3,5);
Pis1 = rand(3,3,3,5);
Is2 = rand(1:3,3,3,3,5);
Pis2 = rand(3,3,3,5);


X = rand(3,3,3,3,5);

function compute_cont(X, I1, I2, PI1, PI2)
    I1_next, I2_next = min(3, I1+1), min(3, I2+1);
    
    output = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2  + X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);
    
    return output
end

for i = 1:3            
    for j = 1:3
        for k = 1:3
            for w = 1:5
                # relevant continuation value
                I1, PI1, I2, PI2 =  Is1[i,j,k,w], Pis1[i,j,k,w], Is2[i,j,k,w], Pis2[i,j,k,w];
                Xcont = compute_cont(X, I1, I2, PI1, PI2);  
                # some other operations occur on Xcont              
            end
        end
    end
end

Some rewrite + minimal optimization to better conform to must-read: Performance Tips · The Julia Language

Is1 = rand(1:3,3,3,3,5);
Pis1 = rand(3,3,3,5);
Is2 = rand(1:3,3,3,3,5);
Pis2 = rand(3,3,3,5);

X = rand(3,3,3,3,5);

function compute_cont!(Xcont, X, I1, I2, PI1, PI2)
    I1_next, I2_next = min(3, I1+1), min(3, I2+1);
    
    @. Xcont = @views X[:, :, I1, I2, :] * PI1 * PI2  + X[:, :, I1_next, I2, :] * (1 - PI1) * PI2  + X[:, :, I1, I2_next, :] * PI1 * (1 - PI2) + X[:, :, I1_next, I2_next, :] * (1 - PI1) * (1 - PI2);
    
    return Xcont
end

function do_thing(X, Is1, Pis1, Is2,Pis2)
    Xcont = similar(@view X[:,:,1,1,:])
    @inbounds for I in CartesianIndices(Is1)
        # relevant continuation value
        I1, PI1, I2, PI2 =  Is1[I], Pis1[I], Is2[I], Pis2[I];
        compute_cont!(Xcont, X, I1, I2, PI1, PI2);  
        # some other operations occur on Xcont              
    end
    return Xcont
end

using BenchmarkTools
@btime do_thing(X, Is1, Pis1, Is2, Pis2);
1 Like

Echoing what @stevengj said, make sure to put the loops intto a function and fuse the loops. Simply changing

output = @views X....

to

output = @. @views X....

took the runtime on my computer from 90 ms to 21 ms, so a nearly 5x speed up by adding two characters.

(@Dan beat me to it and his in-place version is probably even faster)