Parallelization of for loop

Dear Julia users,

I am trying to parallelize the following loop, what do you think is the best approach?

         for n = 1:N
             kron!(temp,A[n,:],B[n,:])
             C.+= temp.*temp'
             Cx .+= temp.*x[n]
         end

I tried using FLoops, but the outputs differ, and the code is way slower:

        for n = 1:N
            kron!(temp,A[n,:],B[n,:])
            @reduce(C= zeros(M,M) + temp*temp', Cx = zeros(M) + temp*x[n])
        end   

I am stuck and would appreciate any kind of help!

Generally, before you parallelize a code, you want to make sure that the serial performance is decent. In particular, you want to avoid unnecessary allocations. Matrix slices allocate in Julia by default, so you want to put @views in front of the kron!.

Also you might want to transpose A and B, since operating on a sliced column should be faster than operating on a sliced row. Like Fortran/MATLAB and in contrast to C/Python, Julia is column major order.

2 Likes

The difficulty with parallelising the loop over n is that you need somewhere to accumulate the results of each thread, to combine at the end.

I think it’ll work better to do the other loops in parallel. And to make the one that is summing the innermost. In which case this layout of the arrays probably isn’t a bad thing. But you will want to avoid making all these slices, and deal directly with the elements.

Here’s one way:

N = 5; M = 2;
A = rand(N,M); B = rand(N,M); x = rand(N); 
C = zeros(M^2, M^2); Cx = zeros(M^2); temp = rand(M^2);

function orig!(C,Cx,A,B,temp)
    for n in 1:size(A,1)
        @views kron!(temp,A[n,:],B[n,:])
        C.+= temp.*temp'
        Cx .+= temp.*x[n]
    end
end
orig!(C,Cx,A,B,temp)

using Tullio  # should parallelise, almost 4x quicker for me
# using LoopVectoriztion  # another 3x or so speed up
@tullio C4[j,i,l,k] := A[n,i] * B[n,j] * A[n,k] * B[n,l];
@tullio Cx4[j,i] := A[n,i] * B[n,j] * x[n];

reshape(C4, size(C)) ≈ C
reshape(Cx4, size(Cx)) ≈ Cx

function alt!(C,Cx,A,B)
    C4 = reshape(C, M,M,M,M)  # writing into original C
    Cx4 = reshape(Cx, M,M)
    @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
    @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
    nothing
end
3 Likes

Amazing, thanks a lot!:slight_smile: