Best way to use CuSparseMatrixBSR

I’ve been writing a set of kernels in which I calculate many small (mostly on the order of ~10x10) square matrices column by column, like in this toy example

using CUDA

# each slice L[:, :, j] is a matrix of interest
Ls = CUDA.rand(3, 3, 100)
f(x) = 3 * x - 2

function matrix_kernel(f, Ls)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    sz = size(Ls)
    len = sz[2] * sz[3]
    for i in ind : stride : len
        m, n = Tuple(CartesianIndices((sz[2], sz[3]))[i])
        Ls[:, m, n] = f(Ls[:, m, n])
    end
end

@cuda threads=3*100 matrix_kernel(f, Ls)

Afterwards I want to solve a linear equation for each of the matrices. I could either create a blockdiagonal matrix holding the slices of Ls and a long vector rs, or write a kernel that solves each small problem. This seemed simpler so I tried it, but left division seems to result in dynamic function invocations

rs = CUDA.rand(3*100)
hs = CUDA.rand(3*100)

function solver_kernel(Ls, rs, hs)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    N = size(Ls, 1)
    len = size(Ls, 3)
    for i in ind : stride : len
        L = Ls[:, :, i]
        j = (i-1) * N
        r = rs[j+1:j+N]
        hs[j+1:j+N] .= L \ r
    end
end

So my that leaves the other option of creating one large sparse blockdiagonal matrix. So my question is, how can I initialize a CuSparseMatrixBSR with the diagonal blocks as described? Its implementation seems to be different from the CUSPARSE C library. It would likely be inefficient to need to call something like

using SparseArrays

cu(blockdiag(sparse.(eachslice(Ls, dims=3))...))

so I’d prefer to initialize the sparse matrix beforehand and manipulate the data in matrix_kernel.

Any help is much appreciated! Especially with niche undocumented methods like this.

Does blockdiag even work for CUDA.jl’s sparse matrix types? We don’t have it specialized. I also don’t have experience with it, so just some general thoughts:

Lacking an implementation of blockdiag, you’d have to initialize the individual components of the sparse array individually (in the case of BSR that’s rowPtr, colVal, nzVal, etc). That’s not very nice, but I wonder if it would even be useful for your case: What do you want to do with the sparse matrix afterwards? We generally don’t support native Julia operations (i.e. kernels) on sparse arrays, with the only exception being broadcast, so you’ll only be able to rely on what CUSPARSE has to offer.

CuDeviceArray (the device counterpart of the CuArray objects you’re using) doesn’t support many vectorized operations, like ldiv! here. That’s because it’s typically a performance trap (to use the GPU efficiently you want to parallelize across the operation, and not just perform a heavyweight operation on few threads), but in your case it would make sense of course. Given that you’re dealing with small arrays, you can always try loading the data into a StaticArray and use that to perform the operations you need.

No you’re right, using blockdiag doesn’t work that way, it was just what I would want to do.

I only really need it to calculate the vectors hs. Aferwards it will be unsafe_free!d. I’ve worked with the internals of sparse matrices before for another part of this project, so I’m not afraid to use though it’ll probably be a bit ugly.

I’ve thought about that, though it’s not exclusive to small arrays. Eg. one usecase involves solving a PDE using Ritz-Galerkin discretization, so the dimension will be on the order of thousands. At that point, does StaticArrays not simply default to the AbstractArray methods?

Apparently some methods behave like that (TIL), but I don’t think it’s done in general, so it’s something to watch out for (or you risk generating huge tuples consuming lots of stack space, and lots of code being generated because of unrolled loops).

I suppose that leaves a large sparse blockdiagonal matrix. Almost everything about CUDA.jl’s CuSparseMatrixBSR seem pretty isomorphic to the CUSPARSE library, but the dir attribute is unclear to me.

mutable struct CuSparseMatrixBSR{Tv, Ti} <: AbstractCuSparseMatrix{Tv, Ti}
    rowPtr::CuVector{Ti}
    colVal::CuVector{Ti}
    nzVal::CuVector{Tv}
    dims::NTuple{2,Int}
    blockDim::Ti
    dir::SparseChar
    nnzb::Ti
    # standard inner constructor here
end

I can’t find any reference to dir in other functions either, so I can’t work backwards to derive what this attribute should be set to.

So after reading through the CUSPARSE docs, specifically the level-2 function bsrmv, I can gleam that the dir attribute refers to the matrix descriptor cusparseDirection_t here.

Further, conversion from CuSparseMatrixCSR to CuSparseMatrixBSR seems to set dir as 'R', consistent with the fact that CSR formatted matrices are parsed row major.

Since julia stores matrices in column major format, it would seem that the appropriate thing to do is to set dir = 'C', assuming that this is the alias for CUSPARSE_DIRECTION_COLUMN.

From this I believe the appropriate construction of the blockdiagonal matrix BLs should be pretty simple

nzval = CUDA.rand(3*3*100)
# == reshape(Ls, 3*3*100)

rowptr = CuArray{Int32,1}(1:101)
colval = CuArray{Int32,1}(1:100)
dims = (3*100, 3*100)
blockdim = 3
dir = 'C'
nnz = 100

BLs = CuSparseMatrixBSR{Float32,Int32}(
    rowptr, colval, nzval, dims, blockdim, dir, nnz
)

This would make indexing into the individual Ls slightly uglier, the first kernel becomes

function matrix_kernel(f, BLsnzVal, BLsblockDim)
    ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    len = length(BLsnzVal) ÷ BLsblockDim
    for i in ind-1 : stride : len-1
        j = 3 * i + 1 : 3 * i + BLsblockDim
        BLsnzval[j] = f(BLsnzVal[j])
    end
end

@cuda threads=3*100 matrix_kernel(f, BLs.nzVal, BLs.blockDim)

I believe this makes sense, but I’d still love to verify that dir = 'C' is appropriate here. I’m sorry I have to ask, I won’t be at my CUDA-capable work laptop until Monday.

That seems correct: CUDA.jl/types.jl at dbc767b1b75f4acb7c40c23178ef79940156551c · JuliaGPU/CUDA.jl · GitHub
The conversion from Chars is modeled to be similar to Julia’s use of LAPACK/BLAS, with e.g. 'T' and 'N' args for transposed/normal.

Ok thank you! Thank you also for the great work on CUDA.jl, I’ve used it in my thesis to get >100x speedup on this project :smile:

1 Like