The difficulty with parallelising the loop over n
is that you need somewhere to accumulate the results of each thread, to combine at the end.
I think it’ll work better to do the other loops in parallel. And to make the one that is summing the innermost. In which case this layout of the arrays probably isn’t a bad thing. But you will want to avoid making all these slices, and deal directly with the elements.
Here’s one way:
N = 5; M = 2;
A = rand(N,M); B = rand(N,M); x = rand(N);
C = zeros(M^2, M^2); Cx = zeros(M^2); temp = rand(M^2);
function orig!(C,Cx,A,B,temp)
for n in 1:size(A,1)
@views kron!(temp,A[n,:],B[n,:])
C.+= temp.*temp'
Cx .+= temp.*x[n]
end
end
orig!(C,Cx,A,B,temp)
using Tullio # should parallelise, almost 4x quicker for me
# using LoopVectoriztion # another 3x or so speed up
@tullio C4[j,i,l,k] := A[n,i] * B[n,j] * A[n,k] * B[n,l];
@tullio Cx4[j,i] := A[n,i] * B[n,j] * x[n];
reshape(C4, size(C)) ≈ C
reshape(Cx4, size(Cx)) ≈ Cx
function alt!(C,Cx,A,B)
C4 = reshape(C, M,M,M,M) # writing into original C
Cx4 = reshape(Cx, M,M)
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
nothing
end