I want to compute the following
result = -1 * (e * (e' * U)) + M * U
Where e = ones(n,1), M is a n x n CuSparseMatrixCSR and U is a n x r dense CuMatrix.
Since it is my understanding that the above operation invokes 4 kernels, I am trying to reduce the number to three by using a function that performs both the product and the sum, like the SPMM function in the CUDA Libraries, something like the function below:
function grad_function(e::CuArray, U::CuArray, M::CuSparseMatrixCSR)
out = -e .* sum(U, dims=1)
alpha = 1.0
beta = 1.0
CUDA.mul!(out, M, U, alpha, beta) # out = alpha*(M*U) + beta*out
return out
end
Is it possible to do it with the CUDA.jl?