# Best way to use CuSparseMatrixBSR

I’ve been writing a set of kernels in which I calculate many small (mostly on the order of ~10x10) square matrices column by column, like in this toy example

``````using CUDA

# each slice L[:, :, j] is a matrix of interest
Ls = CUDA.rand(3, 3, 100)
f(x) = 3 * x - 2

function matrix_kernel(f, Ls)
ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
sz = size(Ls)
len = sz[2] * sz[3]
for i in ind : stride : len
m, n = Tuple(CartesianIndices((sz[2], sz[3]))[i])
Ls[:, m, n] = f(Ls[:, m, n])
end
end

``````

Afterwards I want to solve a linear equation for each of the matrices. I could either create a blockdiagonal matrix holding the slices of `Ls` and a long vector `rs`, or write a kernel that solves each small problem. This seemed simpler so I tried it, but left division seems to result in dynamic function invocations

``````rs = CUDA.rand(3*100)
hs = CUDA.rand(3*100)

function solver_kernel(Ls, rs, hs)
ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
N = size(Ls, 1)
len = size(Ls, 3)
for i in ind : stride : len
L = Ls[:, :, i]
j = (i-1) * N
r = rs[j+1:j+N]
hs[j+1:j+N] .= L \ r
end
end
``````

So my that leaves the other option of creating one large sparse blockdiagonal matrix. So my question is, how can I initialize a `CuSparseMatrixBSR` with the diagonal blocks as described? Its implementation seems to be different from the CUSPARSE C library. It would likely be inefficient to need to call something like

``````using SparseArrays

cu(blockdiag(sparse.(eachslice(Ls, dims=3))...))
``````

so I’d prefer to initialize the sparse matrix beforehand and manipulate the data in `matrix_kernel`.

Any help is much appreciated! Especially with niche undocumented methods like this.

Does `blockdiag` even work for CUDA.jl’s sparse matrix types? We don’t have it specialized. I also don’t have experience with it, so just some general thoughts:

Lacking an implementation of `blockdiag`, you’d have to initialize the individual components of the sparse array individually (in the case of BSR that’s rowPtr, colVal, nzVal, etc). That’s not very nice, but I wonder if it would even be useful for your case: What do you want to do with the sparse matrix afterwards? We generally don’t support native Julia operations (i.e. kernels) on sparse arrays, with the only exception being broadcast, so you’ll only be able to rely on what CUSPARSE has to offer.

CuDeviceArray (the device counterpart of the CuArray objects you’re using) doesn’t support many vectorized operations, like ldiv! here. That’s because it’s typically a performance trap (to use the GPU efficiently you want to parallelize across the operation, and not just perform a heavyweight operation on few threads), but in your case it would make sense of course. Given that you’re dealing with small arrays, you can always try loading the data into a StaticArray and use that to perform the operations you need.

No you’re right, using `blockdiag` doesn’t work that way, it was just what I would want to do.

I only really need it to calculate the vectors `hs`. Aferwards it will be `unsafe_free!`d. I’ve worked with the internals of sparse matrices before for another part of this project, so I’m not afraid to use though it’ll probably be a bit ugly.

I’ve thought about that, though it’s not exclusive to small arrays. Eg. one usecase involves solving a PDE using Ritz-Galerkin discretization, so the dimension will be on the order of thousands. At that point, does `StaticArrays` not simply default to the `AbstractArray` methods?

Apparently some methods behave like that (TIL), but I don’t think it’s done in general, so it’s something to watch out for (or you risk generating huge tuples consuming lots of stack space, and lots of code being generated because of unrolled loops).

I suppose that leaves a large sparse blockdiagonal matrix. Almost everything about CUDA.jl’s `CuSparseMatrixBSR` seem pretty isomorphic to the CUSPARSE library, but the `dir` attribute is unclear to me.

``````mutable struct CuSparseMatrixBSR{Tv, Ti} <: AbstractCuSparseMatrix{Tv, Ti}
rowPtr::CuVector{Ti}
colVal::CuVector{Ti}
nzVal::CuVector{Tv}
dims::NTuple{2,Int}
blockDim::Ti
dir::SparseChar
nnzb::Ti
# standard inner constructor here
end
``````

I can’t find any reference to `dir` in other functions either, so I can’t work backwards to derive what this attribute should be set to.

So after reading through the CUSPARSE docs, specifically the level-2 function bsrmv, I can gleam that the `dir` attribute refers to the matrix descriptor `cusparseDirection_t` here.

Further, conversion from `CuSparseMatrixCSR` to `CuSparseMatrixBSR` seems to set `dir` as `'R'`, consistent with the fact that CSR formatted matrices are parsed row major.

Since julia stores matrices in column major format, it would seem that the appropriate thing to do is to set `dir = 'C'`, assuming that this is the alias for `CUSPARSE_DIRECTION_COLUMN`.

From this I believe the appropriate construction of the blockdiagonal matrix `BLs` should be pretty simple

``````nzval = CUDA.rand(3*3*100)
# == reshape(Ls, 3*3*100)

rowptr = CuArray{Int32,1}(1:101)
colval = CuArray{Int32,1}(1:100)
dims = (3*100, 3*100)
blockdim = 3
dir = 'C'
nnz = 100

BLs = CuSparseMatrixBSR{Float32,Int32}(
rowptr, colval, nzval, dims, blockdim, dir, nnz
)
``````

This would make indexing into the individual `Ls` slightly uglier, the first kernel becomes

``````function matrix_kernel(f, BLsnzVal, BLsblockDim)
ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
len = length(BLsnzVal) ÷ BLsblockDim
for i in ind-1 : stride : len-1
j = 3 * i + 1 : 3 * i + BLsblockDim
BLsnzval[j] = f(BLsnzVal[j])
end
end

I believe this makes sense, but I’d still love to verify that `dir = 'C'` is appropriate here. I’m sorry I have to ask, I won’t be at my CUDA-capable work laptop until Monday.
The conversion from Chars is modeled to be similar to Julia’s use of LAPACK/BLAS, with e.g. `'T'` and `'N'` args for transposed/normal.