Applying batched_mul from
NNlibCUDA to a CuArray tensor and a CuSparseMatrix matrix falls back to the generic method and performs scalar indexing. Is there a GPU-efficient way to broadcast multiplication of a CuSparseMatrix over a CuArray tensor?
p.s. There is currently an issue where the
batched_mul will cause the output to be a CPU array, but even fixing that does not seem to solve the scalar indexing problem.
using CUDA, CUDA.CUSPARSE, NNlib, NNlibCUDA CUDA.allowscalar(false) a = CuArray(ones(2,2,3)) b = CuSparseMatrix(CuArray(ones(2,2))) c = batched_mul(a,b) #This does scalar indexing