I’m trying to find a good way to compute V^TAV where A is a sparse CSC matrix A and V = I \otimes B is a block diagonal matrix with repeating dense blocks (size of B is around \mathbb{R}^{20 \times 10}).

Looping through blocks A_i and computing B^TA_iB seems about 2x slower than just storing V as a sparse matrix, but I got about 2x speedup using dense matrix operations by popping A into a dense array \tilde{A}

and BLAS-3 multiplying \tilde{A} by B^T, B using

`reshape`

and `permutedims`

.
This seems a little awkward to me, and I’m wondering if there are nicer “Julian” solutions that I’ve overlooked.