Hi. I am new to GPU programming with Julia, and trying to run a matrix multiplication within a loop. It is important this is performant, so want to minimise allocations. But if I run the following minimal example (as I would do the multiplication on the CPU)

using CUDA
A = CuArray(rand(999,999))
B = CuArray(rand(1000,1000))
C = CuArray(rand(1000,1000))
inds = collect(1:999)
mul!(A, @view(B[inds, inds]), @view(C[inds, inds]))

I get an error “performing scalar indexing” in the final line and the operation is very slow (multiple seconds). If I instead run

using CUDA
A = CuArray(rand(999,999))
B = CuArray(rand(1000,1000))
C = CuArray(rand(1000,1000))
inds = collect(1:999)
mul!(A, B[inds, inds], C[inds, inds])

I do not get the error, but the code is slow due to all the allocations from slicing the arrays. Is there an efficient way to do this operation in Cuda.jl without getting any allocations in the multiplication step?

That’s because those views don’t result in contiguous or strided memory. We don’t seem to have the necessary methods to dispatch that kind of mul! (with two SubArray inputs) to a GPU implementation of mul!, but even if we did, we’d only be able to dispatch to our slow fallback gemm implementation. If you use contiguous or strided arrays, you should able to use CUBLAS, which is fast.

Thanks for the quick reply! As a follow up, is there a way to check/make sure that such arrays are contiguous in memory so I can apply CuBLAS? The operation is part of a block coordinate-descent algorithm to estimate one of these matrices. So I fix the types and sizes of the arrays at the beginning, and then only need to mutate one column at a time by multiplying two sub-arrays. Hence it would be great if I could keep things on the GPU to speed the calculations up a lot.

Actually, these views are strided, but that information is lost from the type because you’re indexing using vectors and not ranges. If you just omit collect, everything works:

using CUDA
using LinearAlgebra
A = CuArray(rand(999,999))
B = CuArray(rand(1000,1000))
C = CuArray(rand(1000,1000))
inds = 1:999
@views mul!(A, B[inds, inds], C[inds, inds])

Contiguous views are represented by CuArray objects; we avoid using SubArray for those, so they are easy to identify. Strided CuArrays can be identified using the StridedCuArray type alias, which is what most CUBLAS methods use for dispatch too.

Thanks so much! Related question then - is there a way to do the indexing leaving one column out for each j in this way? Say I need to iterate j from 1 to 1000 leaving that column out each time

No, that wouldn’t be strided in general. I suppose you could split it into two matmuls, one for columns 1:(j - 1) and one for columns (j + 1):1000. Use the 5-argument version of mul! to add the second to the first. I don’t know exactly what slices you need to take, but it might look something like this:

@views begin
mul!(y, M[:, 1:(j - 1)], x[1:(j - 1)])
mul!(y, M[:, (j + 1):end], x[(j + 1):end], 1, 1) # Call mul! like this to add to the values already in `y`
end