How can I use a CUDA kernel to normalize a big matrix A by row and use the same scaling factors to a vector b?

I don’t have a lot of practical experience but I guess you could use a `Diagonal`

matrix with the normalization constants in both cases. CUDA has fast linear algebra kernels (cuBLAS)

Thanks, @josuagrw

First I tried

```
function row_scale_1_kernel(
scales::CuDeviceVector{Tv},
rowPtr::CuDeviceVector{Ti},
colVal::CuDeviceVector{Ti},
nzVal::CuDeviceVector{Tv}
) where {Tv, Ti}
row = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if row <= length(rowPtr) - 1 && row <= size(scales, 1)
row_start = rowPtr[row]
row_end = rowPtr[row + 1] - 1
sum = zero(Tv)
for j = row_start:row_end
sum += nzVal[j] * nzVal[j]
end
scales[row] = 1.0 / sqrt(sum)
end
return
end
function A_matrix(A_cpu::SparseArrays.SparseMatrixCSC{Float64,Int})
A = CuSparseMatrixCSR(A_cpu)
rows, cols = size(A) # A is in CSR format in GPU
scale_constants = CUDA.zeros(Float64, size(A, 1))
rowPtr = A.rowPtr
colVal = A.colVal
nzVal = A.nzVal
num_rows = size(A, 1) #Number of rows in the matrix
threads_per_block = 256 # Number of threads per block (can be 256 or any other suitable value)
blocks = cld(num_rows, threads_per_block) # Number of blocks needed to cover all rows
@cuda threads=threads_per_block blocks=blocks row_scale_1_kernel(
scale_constants,
rowPtr,
colVal,
nzVal,
)
D = Diagonal(scale_constants)
A = D * A
a_i = Array(rowPtr)
a_j = Array(colVal)
a_v = Array(nzVal)
scale = Array(scale_constants)
return a_i, a_j, a_v, scale, rows, cols
end
```

It returns the right values, but before Julia stops, I get tones of

```
WARNING: Error while freeing DeviceBuffer(37.820 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)
```

I’m not qualified to comment on this pointer arithmetic.

But are you certain you cannot do this using the library functions? It looks like all you need is

- a dense element-wise multiply to square the values
- a sparse matrix vector multiply to sum each row

Both of these definitely exist in cuSparse and cuBLAS. I would not reinvent the wheel here…

Does it have to be a kernel? It could easily be done via broadcast. Something like:

```
n = sqrt.(sum(abs2, A, dims=2)))
A ./= n
b ./= n
```

`n = sqrt.(sum(abs2, A, dims=2))`

causes scalar indexing of a GPU array, so I used Kernel.

Try this code

```
using CUDA
using SparseArrays
using CUDA.CUSPARSE
using LinearAlgebra
n = 1000
A_sparse = sprand(n, n, 0.5)
A = CuSparseMatrixCSR(A_sparse)
n = sqrt.(sum(abs2, A, dims=2))
```

Looks like support for reductions like `sum(abs2, A; dims=2)`

for CUSPARSE arrays was recently implemented but not yet released: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

So you can either wait for the next release of CUDA.jl (ping @maleadt), or repurpose some of the code removed in that commit, specifically the functions highlighted here: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

That’s the removed code; the implementation of `mapreduce`

below should be more generic.

Just FYI, there’s a couple of minor but breaking changes incoming (such as the deprecation of Julia 1.6-1.7 and CUDA 11.0-11.3), so the next release will take a while. If you desperately need a feature in a release, we can consider backporting it to the previous stable one.

I just added the `lp/sum-dim-fix`

branch, I could use it like

```
using CUDA
using CUDA.CUSPARSE
using SparseArrays
using LinearAlgebra
function scale!(coefficients)
A = CuSparseMatrixCSR(convert(SparseArrays.SparseMatrixCSC{Float64,Int}, coefficients))
scaling_factors = 1.0 ./ sqrt.(sum(abs2, A, dims=2))
D = Diagonal(scaling_factors)
A = D * A
return A
end
m,n = 5,6
p = 0.5
x = sprand(Float64, m, n, p)
scale!(x)
```

But I am not sure; when I use the same function inside my code, I get the same error

```
WARNING: Error while freeing DeviceBuffer(223.113 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)
```

That error is unrelated. The exception, `CUDA_ERROR_CONTEXT_IS_DESTROYED`

, indicates that some memory is incorrectly being freed during process teardown. If you have an MWE, please file an issue.