Normalize a large matrix by row

How can I use a CUDA kernel to normalize a big matrix A by row and use the same scaling factors to a vector b?

I don’t have a lot of practical experience but I guess you could use a Diagonal matrix with the normalization constants in both cases. CUDA has fast linear algebra kernels (cuBLAS)

Thanks, @josuagrw
First I tried

function row_scale_1_kernel(
) where {Tv, Ti}
    row = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if row <= length(rowPtr) - 1 && row <= size(scales, 1)
        row_start = rowPtr[row]
        row_end = rowPtr[row + 1] - 1
        sum = zero(Tv)
        for j = row_start:row_end
            sum += nzVal[j] * nzVal[j]
        scales[row] = 1.0 / sqrt(sum)

function A_matrix(A_cpu::SparseArrays.SparseMatrixCSC{Float64,Int})
    A = CuSparseMatrixCSR(A_cpu)
    rows, cols = size(A) # A is in CSR format in GPU

    scale_constants = CUDA.zeros(Float64, size(A, 1))
    rowPtr = A.rowPtr
    colVal = A.colVal
    nzVal = A.nzVal

    num_rows = size(A, 1)   #Number of rows in the matrix
    threads_per_block = 256 # Number of threads per block (can be 256 or any other suitable value)
    blocks = cld(num_rows, threads_per_block)   # Number of blocks needed to cover all rows
    @cuda threads=threads_per_block blocks=blocks row_scale_1_kernel(
    D = Diagonal(scale_constants)
    A = D * A

    a_i =  Array(rowPtr)
    a_j = Array(colVal)
    a_v = Array(nzVal)
    scale = Array(scale_constants)

    return a_i, a_j, a_v, scale, rows, cols

It returns the right values, but before Julia stops, I get tones of

WARNING: Error while freeing DeviceBuffer(37.820 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)

I’m not qualified to comment on this pointer arithmetic.

But are you certain you cannot do this using the library functions? It looks like all you need is

  1. a dense element-wise multiply to square the values
  2. a sparse matrix vector multiply to sum each row

Both of these definitely exist in cuSparse and cuBLAS. I would not reinvent the wheel here…

1 Like

Does it have to be a kernel? It could easily be done via broadcast. Something like:

n = sqrt.(sum(abs2, A, dims=2)))
A ./= n
b ./= n
1 Like


n = sqrt.(sum(abs2, A, dims=2)) causes scalar indexing of a GPU array, so I used Kernel.

Try this code

using CUDA
using SparseArrays
using LinearAlgebra

n = 1000
A_sparse = sprand(n, n, 0.5)
A = CuSparseMatrixCSR(A_sparse)
n = sqrt.(sum(abs2, A, dims=2))

Looks like support for reductions like sum(abs2, A; dims=2) for CUSPARSE arrays was recently implemented but not yet released: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

So you can either wait for the next release of CUDA.jl (ping @maleadt), or repurpose some of the code removed in that commit, specifically the functions highlighted here: CUSPARSE: Implement out-of-place reductions (#1987) · JuliaGPU/CUDA.jl@7660edc · GitHub

That’s the removed code; the implementation of mapreduce below should be more generic.

Just FYI, there’s a couple of minor but breaking changes incoming (such as the deprecation of Julia 1.6-1.7 and CUDA 11.0-11.3), so the next release will take a while. If you desperately need a feature in a release, we can consider backporting it to the previous stable one.

1 Like

Thanks @maleadt

I really need that functionality because CPU calculation is very lengthy.

I just added the lp/sum-dim-fix branch, I could use it like

using CUDA
using SparseArrays
using LinearAlgebra

function scale!(coefficients)
    A = CuSparseMatrixCSR(convert(SparseArrays.SparseMatrixCSC{Float64,Int}, coefficients))
    scaling_factors = 1.0 ./ sqrt.(sum(abs2, A, dims=2))
    D = Diagonal(scaling_factors)
    A = D * A
    return A

m,n = 5,6
p = 0.5
x = sprand(Float64, m, n, p)

But I am not sure; when I use the same function inside my code, I get the same error

WARNING: Error while freeing DeviceBuffer(223.113 KiB at 0x0000000302000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002c5), meta=nothing)

That error is unrelated. The exception, CUDA_ERROR_CONTEXT_IS_DESTROYED, indicates that some memory is incorrectly being freed during process teardown. If you have an MWE, please file an issue.