Slice the type CuSparseMatrixCSC matrix

I have a doubt regarding slicing the sparse matrix on GPU.I don’t know why does the result matrix start assigning values from the second row.This is an unexpected result for me.How can i fix the error?

using CUDA
using CUDA.CUSPARSE
using SparseArrays
using Printf

a = [1,2,3]
b = [1,2,3]
n = 10  # size of the matrix
density = 0.7  # density of the non-zero elements
sparse_matrix = sprand(n, n, density)
na = length(a)
nb = length(b)
a_gpu = CUDA.CuArray(a)
b_gpu = CUDA.CuArray(b)
result = CUDA.zeros(Float32, na, nb)
# Convert to CuSparseMatrixCSC
sparse_matrix_gpu = CUDA.CUSPARSE.CuSparseMatrixCSC(sparse_matrix)
#The col_indices
col=sparse_matrix_gpu.colPtr
#the row_indices
row=sparse_matrix_gpu.rowVal
#the values
val=sparse_matrix_gpu.nzVal

function kernel_index!(result,row_indices,col_ptrs,val,a,b,na,nb)
    i=threadIdx().x+(blockIdx().x-1)*blockDim().x+1
    if(i<=na)
        for j in 1:nb
            row=a[i]-1
            col=b[j]
            value=0.0
            for k in col_ptrs[col]:(col_ptrs[col+1]-1)
                if row_indices[k]==row
                    value=val[k]
                    break
                end
            end
            result[i,j]=value
        end
    end
    return
end
threads_per_block = 256
blocks_per_grid = ceil(Int, na / threads_per_block)
@cuda threads=threads_per_block blocks=blocks_per_grid kernel_index!(result, row, col, val, a_gpu, b_gpu, na, nb)

The result matrix is

julia> result
3×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0       0.0       0.0
 0.0       0.248917  0.263072
 0.128449  0.0       0.620226

It only saves sparse_matrix_gpu [1:2,1:3] and starts assigning values from the second row of the result matrix

Hi, and welcome to the Julia community!

Here you correctly noted that blockIdx().x starts at 1, so that you need to subtract 1 for the offset calculation. Similarly, threadIdx().x also begins at 1, meaning that to end up with a 1-indexed i, you don’t need to subtract (or add) 1. So the correct code would be

i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

In general these index manipulations can be confusing, so I would advise to just manually try out the boundary cases. E.g. the ‘first’ thread has threadIdx().x == 1 == blockIdx().x and should end up with i = 1.

As a side-note, 1 is of type Int64, which you typically want to avoid on the GPU. Instead you should use a 32-bit version, e.g. as 1i32, after importing via using CUDA: i32. See the Performance Tips for more information.

Thank you very much for your help. It is very helpful in improving efficiency

1 Like