Hello all,
This might be too niche but I am doing the following sparse tenor times matrix multiplication:
A(n,m,l) = B(i,j,k)
C(i,n)
C(j,m)C(l,k)
I can calculate a single element of the matrix A: A[n,m,l]
where B is a stored as a COO vector.
function gpu_k3_kernel(B, C_n, C_m, C_l)
f = (A_data) -> B_data.val * C_n[B_data.i] * C_m[B_data.j] * C_l[B_data.k]
return mapreduce(f, +, A)
end
but this has a horrible memory access pattern since the indices i,j,k
randomly acesss C
and the kernel runs into memory bandwidth issues quickly. Not to mention I have to re-launch this kernel for every element.
I can re-write this kernel to do the entire multiplication at once with the code below and so that C
is accessed in a more organized manner. The only problem is now I am accumulating into the array A
which I am not really sure how to handle in an easy way (I’m not expert GPU programmer). Is there some way to do this without having to re-write a reduce kernel?? Any help would be greatly appreciated. Thanks!
for b in eachindex(B)
for o in 1:N_modes
for n in 1:o
for m in 1:n
A[m, n, o] += B[b].val * C[B[b].i, m] * C[B[b].j, n] * C[B[b].k, o]
end
end
end
end