How do I reduce into an array index on a GPU?

Hello all,

This might be too niche but I am doing the following sparse tenor times matrix multiplication:
A(n,m,l) = B(i,j,k) C(i,n) C(j,m)C(l,k)

I can calculate a single element of the matrix A: A[n,m,l] where B is a stored as a COO vector.

function gpu_k3_kernel(B, C_n, C_m, C_l)
    f = (A_data) -> B_data.val * C_n[B_data.i] * C_m[B_data.j] * C_l[B_data.k]
   return mapreduce(f, +, A)
end

but this has a horrible memory access pattern since the indices i,j,k randomly acesss C and the kernel runs into memory bandwidth issues quickly. Not to mention I have to re-launch this kernel for every element.

I can re-write this kernel to do the entire multiplication at once with the code below and so that C is accessed in a more organized manner. The only problem is now I am accumulating into the array A which I am not really sure how to handle in an easy way (I’m not expert GPU programmer). Is there some way to do this without having to re-write a reduce kernel?? Any help would be greatly appreciated. Thanks!

for b in eachindex(B)
        for o in 1:N_modes
            for n in 1:o
                for m in 1:n
                    A[m, n, o] += B[b].val * C[B[b].i, m] * C[B[b].j, n] * C[B[b].k, o]
                end
            end
        end
    end