Best way to use CuSparseMatrixBSR

April-Hannah-Lena · August 20, 2022, 7:41pm

I’ve been writing a set of kernels in which I calculate many small (mostly on the order of ~10x10) square matrices column by column, like in this toy example

using CUDA

# each slice L[:, :, j] is a matrix of interest
Ls = CUDA.rand(3, 3, 100)
f(x) = 3 * x - 2

function matrix_kernel(f, Ls)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    sz = size(Ls)
    len = sz[2] * sz[3]
    for i in ind : stride : len
        m, n = Tuple(CartesianIndices((sz[2], sz[3]))[i])
        Ls[:, m, n] = f(Ls[:, m, n])
    end
end

@cuda threads=3*100 matrix_kernel(f, Ls)

Afterwards I want to solve a linear equation for each of the matrices. I could either create a blockdiagonal matrix holding the slices of Ls and a long vector rs, or write a kernel that solves each small problem. This seemed simpler so I tried it, but left division seems to result in dynamic function invocations

rs = CUDA.rand(3*100)
hs = CUDA.rand(3*100)

function solver_kernel(Ls, rs, hs)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    N = size(Ls, 1)
    len = size(Ls, 3)
    for i in ind : stride : len
        L = Ls[:, :, i]
        j = (i-1) * N
        r = rs[j+1:j+N]
        hs[j+1:j+N] .= L \ r
    end
end

So my that leaves the other option of creating one large sparse blockdiagonal matrix. So my question is, how can I initialize a CuSparseMatrixBSR with the diagonal blocks as described? Its implementation seems to be different from the CUSPARSE C library. It would likely be inefficient to need to call something like

using SparseArrays

cu(blockdiag(sparse.(eachslice(Ls, dims=3))...))

so I’d prefer to initialize the sparse matrix beforehand and manipulate the data in matrix_kernel.

Any help is much appreciated! Especially with niche undocumented methods like this.

maleadt · August 20, 2022, 7:56pm

Does blockdiag even work for CUDA.jl’s sparse matrix types? We don’t have it specialized. I also don’t have experience with it, so just some general thoughts:

Lacking an implementation of blockdiag, you’d have to initialize the individual components of the sparse array individually (in the case of BSR that’s rowPtr, colVal, nzVal, etc). That’s not very nice, but I wonder if it would even be useful for your case: What do you want to do with the sparse matrix afterwards? We generally don’t support native Julia operations (i.e. kernels) on sparse arrays, with the only exception being broadcast, so you’ll only be able to rely on what CUSPARSE has to offer.

CuDeviceArray (the device counterpart of the CuArray objects you’re using) doesn’t support many vectorized operations, like ldiv! here. That’s because it’s typically a performance trap (to use the GPU efficiently you want to parallelize across the operation, and not just perform a heavyweight operation on few threads), but in your case it would make sense of course. Given that you’re dealing with small arrays, you can always try loading the data into a StaticArray and use that to perform the operations you need.

April-Hannah-Lena · August 20, 2022, 8:21pm

No you’re right, using blockdiag doesn’t work that way, it was just what I would want to do.

I only really need it to calculate the vectors hs. Aferwards it will be unsafe_free!d. I’ve worked with the internals of sparse matrices before for another part of this project, so I’m not afraid to use though it’ll probably be a bit ugly.

I’ve thought about that, though it’s not exclusive to small arrays. Eg. one usecase involves solving a PDE using Ritz-Galerkin discretization, so the dimension will be on the order of thousands. At that point, does StaticArrays not simply default to the AbstractArray methods?

maleadt · August 20, 2022, 8:42pm

Apparently some methods behave like that (TIL), but I don’t think it’s done in general, so it’s something to watch out for (or you risk generating huge tuples consuming lots of stack space, and lots of code being generated because of unrolled loops).

April-Hannah-Lena · August 21, 2022, 5:19am

I suppose that leaves a large sparse blockdiagonal matrix. Almost everything about CUDA.jl’s CuSparseMatrixBSR seem pretty isomorphic to the CUSPARSE library, but the dir attribute is unclear to me.

mutable struct CuSparseMatrixBSR{Tv, Ti} <: AbstractCuSparseMatrix{Tv, Ti}
    rowPtr::CuVector{Ti}
    colVal::CuVector{Ti}
    nzVal::CuVector{Tv}
    dims::NTuple{2,Int}
    blockDim::Ti
    dir::SparseChar
    nnzb::Ti
    # standard inner constructor here
end

I can’t find any reference to dir in other functions either, so I can’t work backwards to derive what this attribute should be set to.

April-Hannah-Lena · August 21, 2022, 9:33am

So after reading through the CUSPARSE docs, specifically the level-2 function bsrmv, I can gleam that the dir attribute refers to the matrix descriptor cusparseDirection_t here.

Further, conversion from CuSparseMatrixCSR to CuSparseMatrixBSR seems to set dir as 'R', consistent with the fact that CSR formatted matrices are parsed row major.

Since julia stores matrices in column major format, it would seem that the appropriate thing to do is to set dir = 'C', assuming that this is the alias for CUSPARSE_DIRECTION_COLUMN.

From this I believe the appropriate construction of the blockdiagonal matrix BLs should be pretty simple

nzval = CUDA.rand(3*3*100)
# == reshape(Ls, 3*3*100)

rowptr = CuArray{Int32,1}(1:101)
colval = CuArray{Int32,1}(1:100)
dims = (3*100, 3*100)
blockdim = 3
dir = 'C'
nnz = 100

BLs = CuSparseMatrixBSR{Float32,Int32}(
    rowptr, colval, nzval, dims, blockdim, dir, nnz
)

This would make indexing into the individual Ls slightly uglier, the first kernel becomes

function matrix_kernel(f, BLsnzVal, BLsblockDim)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    len = length(BLsnzVal) ÷ BLsblockDim
    for i in ind-1 : stride : len-1
        j = 3 * i + 1 : 3 * i + BLsblockDim
        BLsnzval[j] = f(BLsnzVal[j])
    end
end

@cuda threads=3*100 matrix_kernel(f, BLs.nzVal, BLs.blockDim)

I believe this makes sense, but I’d still love to verify that dir = 'C' is appropriate here. I’m sorry I have to ask, I won’t be at my CUDA-capable work laptop until Monday.

maleadt · August 21, 2022, 9:37am

That seems correct: CUDA.jl/types.jl at dbc767b1b75f4acb7c40c23178ef79940156551c · JuliaGPU/CUDA.jl · GitHub
The conversion from Chars is modeled to be similar to Julia’s use of LAPACK/BLAS, with e.g. 'T' and 'N' args for transposed/normal.

April-Hannah-Lena · August 21, 2022, 9:42am

Ok thank you! Thank you also for the great work on CUDA.jl, I’ve used it in my thesis to get >100x speedup on this project

Topic		Replies	Views
Example of direct linear solve, Sparse Matrix in cuda General Usage gpu , linearalgebra	2	1107	March 18, 2019
Cannot solve Ax=B on GPU with A CuSparseMatrixCSC New to Julia cuda , sparse	9	1144	May 20, 2021
Use GPU to generate known sparse matrix GPU	2	2016	January 6, 2020
Spase by dense matrix product General Usage	0	475	September 23, 2017
(fast) sparse solve on GPU General Usage question , gpu	7	1716	December 28, 2019

Best way to use CuSparseMatrixBSR

Related topics