# Normalize a large matrix by row

How can I use a CUDA kernel to normalize a big matrix A by row and use the same scaling factors to a vector b?

I don’t have a lot of practical experience but I guess you could use a `Diagonal` matrix with the normalization constants in both cases. CUDA has fast linear algebra kernels (cuBLAS)

Thanks, @josuagrw
First I tried

``````function row_scale_1_kernel(
scales::CuDeviceVector{Tv},
rowPtr::CuDeviceVector{Ti},
colVal::CuDeviceVector{Ti},
nzVal::CuDeviceVector{Tv}
) where {Tv, Ti}
row = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if row <= length(rowPtr) - 1 && row <= size(scales, 1)
row_start = rowPtr[row]
row_end = rowPtr[row + 1] - 1
sum = zero(Tv)
for j = row_start:row_end
sum += nzVal[j] * nzVal[j]
end
scales[row] = 1.0 / sqrt(sum)
end
return
end

function A_matrix(A_cpu::SparseArrays.SparseMatrixCSC{Float64,Int})
A = CuSparseMatrixCSR(A_cpu)
rows, cols = size(A) # A is in CSR format in GPU

scale_constants = CUDA.zeros(Float64, size(A, 1))
rowPtr = A.rowPtr
colVal = A.colVal
nzVal = A.nzVal

num_rows = size(A, 1)   #Number of rows in the matrix
threads_per_block = 256 # Number of threads per block (can be 256 or any other suitable value)
blocks = cld(num_rows, threads_per_block)   # Number of blocks needed to cover all rows

scale_constants,
rowPtr,
colVal,
nzVal,
)

D = Diagonal(scale_constants)
A = D * A

a_i =  Array(rowPtr)
a_j = Array(colVal)
a_v = Array(nzVal)
scale = Array(scale_constants)

return a_i, a_j, a_v, scale, rows, cols
end
``````

It returns the right values, but before Julia stops, I get tones of

``````WARNING: Error while freeing DeviceBuffer(37.820 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)
``````

I’m not qualified to comment on this pointer arithmetic.

But are you certain you cannot do this using the library functions? It looks like all you need is

1. a dense element-wise multiply to square the values
2. a sparse matrix vector multiply to sum each row

Both of these definitely exist in cuSparse and cuBLAS. I would not reinvent the wheel here…

1 Like

Does it have to be a kernel? It could easily be done via broadcast. Something like:

``````n = sqrt.(sum(abs2, A, dims=2)))
A ./= n
b ./= n
``````
1 Like

@Per

`n = sqrt.(sum(abs2, A, dims=2))` causes scalar indexing of a GPU array, so I used Kernel.

Try this code

``````using CUDA
using SparseArrays
using CUDA.CUSPARSE
using LinearAlgebra

n = 1000
A_sparse = sprand(n, n, 0.5)
A = CuSparseMatrixCSR(A_sparse)
n = sqrt.(sum(abs2, A, dims=2))
``````

Looks like support for reductions like `sum(abs2, A; dims=2)` for CUSPARSE arrays was recently implemented but not yet released: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

So you can either wait for the next release of CUDA.jl (ping @maleadt), or repurpose some of the code removed in that commit, specifically the functions highlighted here: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

That’s the removed code; the implementation of `mapreduce` below should be more generic.

Just FYI, there’s a couple of minor but breaking changes incoming (such as the deprecation of Julia 1.6-1.7 and CUDA 11.0-11.3), so the next release will take a while. If you desperately need a feature in a release, we can consider backporting it to the previous stable one.

1 Like

I really need that functionality because CPU calculation is very lengthy.

I just added the `lp/sum-dim-fix` branch, I could use it like

``````using CUDA
using CUDA.CUSPARSE
using SparseArrays
using LinearAlgebra

function scale!(coefficients)
A = CuSparseMatrixCSR(convert(SparseArrays.SparseMatrixCSC{Float64,Int}, coefficients))
scaling_factors = 1.0 ./ sqrt.(sum(abs2, A, dims=2))
D = Diagonal(scaling_factors)
A = D * A
return A
end

m,n = 5,6
p = 0.5
x = sprand(Float64, m, n, p)
scale!(x)
``````

But I am not sure; when I use the same function inside my code, I get the same error

``````WARNING: Error while freeing DeviceBuffer(223.113 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)
``````

That error is unrelated. The exception, `CUDA_ERROR_CONTEXT_IS_DESTROYED`, indicates that some memory is incorrectly being freed during process teardown. If you have an MWE, please file an issue.