First CUDA Program- Addressing Scalar Indexing

Hello, I am trying to get my feet wet in the world of CUDA programming. I started with some CPU code that I previously written but have been having some trouble with Scalar Indexing slowing down the computation. The problem stems from this x = v + (A_GPU' * (Vector_GPU - A_GPU * v)). I believe its due to the addition but I wasn’t sure. I was hoping that someone may have some recommendations on how to optimize this calculation or suggest some documentation or examples that could help me better understand.

I’ve included some sample code below showing my initial implementation of the code.

using CUDA, SparseArrays
using CUDA.CUSPARSE

if has_cuda_gpu()
    CUDA.allowscalar(false)
end

function GPU_test(A_GPU, Vector_GPU, niter)
    @views @inbounds begin
        v = CUDA.zeros(size(A_GPU,2))
        x_old = CUDA.zeros(size(A_GPU,2))
        for iter = 1:niter
            x = v + (A_GPU' * (Vector_GPU - A_GPU * v))
            v = x + 0.5 * (x - x_old)
            x_old = x           
        end
    end
    return x
end

function GPU_init(A_CPU,Vector_CPU, n, niter_GPU)
    A_GPU = CuSparseMatrixCSR(A_CPU)
    Vector_GPU = CuArray{ComplexF32}(undef,size(Vector_CPU))
    copyto!(Vector_GPU, Vector_CPU)
    return GPU_test(A_GPU, Vector_GPU, niter_GPU)
end
## ---
m = 800
n = 2500
d = 0.75
niter_GPU = 100

A_CPU = sprand(ComplexF32,m,n,d)
Vector_CPU = zeros(ComplexF32,m)

out = GPU_init(A_CPU,Vector_CPU, n, niter_GPU)

You should look at the back trace of the error message:

julia> out = GPU_init(A_CPU,Vector_CPU, n, niter_GPU)
ERROR: Scalar indexing is disallowed.
...
Stacktrace:
...
  [7] generic_matvecmul!(C::CuArray{ComplexF32, 1}, tA::Char, A::CuSparseMatrixCSR{ComplexF32}, B::CuArray{Float32, 1}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:759
  [8] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:81 [inlined]
  [9] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
 [10] *(A::CuSparseMatrixCSR{ComplexF32}, x::CuArray{Float32, 1})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:51
 [11] GPU_test(A_GPU::CuSparseMatrixCSR{ComplexF32}, Vector_GPU::CuArray{ComplexF32, 1}, niter::Int64)
    @ Main ./REPL[5]:6

It’s clear here that this is triggered by *, because you’re doing sparse x dense multiplication with mismatched types, which falls back to the Base implementation. For sparse operations, we only use CUSPARSE (i.e. we don’t have a natively-implemented generic fallback), so you need to stick to what CUSPARSE supports.

Thank you for pointing that out! Though looking through the CUSPRASE.jl documentation, it appears that it should support sparse x dense multiplication. Are you saying that the issues stems from the type difference, i.e.ComplexF32 vs Float32?

Yes. It’s possible CUSPARSE nowadays has generic APIs that allow mixing types (happy to take a PR for this), but the old APIs used to be pretty constrained.

Thank you for the help and suggestion! Addressing the type issue eliminated the error but I also saw several ways I could improve this basic algorithm as well.

I was wondering if you had any recommendations on resources or examples that would be good to learn from? I really like how simple it can be to perform a calculation on a GPU but I would like to dive deeper into programming GPUs, specifically trying to write my own kernals and such.

There’s a great introductory tutorial here: Introduction · CUDA.jl