Hello all,

This might be too niche but I am doing the following sparse tenor times matrix multiplication:

`A(n,m,l) = B(i,j,k)`

`C(i,n)`

`C(j,m)C(l,k)`

I can calculate a single element of the matrix A: `A[n,m,l]`

where B is a stored as a COO vector.

```
function gpu_k3_kernel(B, C_n, C_m, C_l)
f = (A_data) -> B_data.val * C_n[B_data.i] * C_m[B_data.j] * C_l[B_data.k]
return mapreduce(f, +, A)
end
```

but this has a horrible memory access pattern since the indices `i,j,k`

randomly acesss `C`

and the kernel runs into memory bandwidth issues quickly. Not to mention I have to re-launch this kernel for every element.

I can re-write this kernel to do the entire multiplication at once with the code below and so that `C`

is accessed in a more organized manner. The only problem is now I am accumulating into the array `A`

which I am not really sure how to handle in an easy way (I’m not expert GPU programmer). Is there some way to do this without having to re-write a reduce kernel?? Any help would be greatly appreciated. Thanks!

```
for b in eachindex(B)
for o in 1:N_modes
for n in 1:o
for m in 1:n
A[m, n, o] += B[b].val * C[B[b].i, m] * C[B[b].j, n] * C[B[b].k, o]
end
end
end
end
```