I’ve been writing a set of kernels in which I calculate many small (mostly on the order of ~10x10) square matrices column by column, like in this toy example

```
using CUDA
# each slice L[:, :, j] is a matrix of interest
Ls = CUDA.rand(3, 3, 100)
f(x) = 3 * x - 2
function matrix_kernel(f, Ls)
ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
sz = size(Ls)
len = sz[2] * sz[3]
for i in ind : stride : len
m, n = Tuple(CartesianIndices((sz[2], sz[3]))[i])
Ls[:, m, n] = f(Ls[:, m, n])
end
end
@cuda threads=3*100 matrix_kernel(f, Ls)
```

Afterwards I want to solve a linear equation for each of the matrices. I could either create a blockdiagonal matrix holding the slices of `Ls`

and a long vector `rs`

, or write a kernel that solves each small problem. This seemed simpler so I tried it, but left division seems to result in dynamic function invocations

```
rs = CUDA.rand(3*100)
hs = CUDA.rand(3*100)
function solver_kernel(Ls, rs, hs)
ind = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
N = size(Ls, 1)
len = size(Ls, 3)
for i in ind : stride : len
L = Ls[:, :, i]
j = (i-1) * N
r = rs[j+1:j+N]
hs[j+1:j+N] .= L \ r
end
end
```

So my that leaves the other option of creating one large sparse blockdiagonal matrix. So my question is, how can I initialize a `CuSparseMatrixBSR`

with the diagonal blocks as described? Its implementation seems to be different from the CUSPARSE C library. It would likely be inefficient to need to call something like

```
using SparseArrays
cu(blockdiag(sparse.(eachslice(Ls, dims=3))...))
```

so I’d prefer to initialize the sparse matrix beforehand and manipulate the data in `matrix_kernel`

.

Any help is much appreciated! Especially with niche undocumented methods like this.