I have this sparse matrix operation R*kron(I,C)*R' where R is a sparse matrix, I is the identity matrix, and C is a dense square matrix. The size of these matrices can be quite large; R is 2000 x 6250000 and the kronecker product is 6250000 x 6250000 but fortunately only the C 2500 x 2500 matrix needs to be stored. The product will only be a 2000 x 2000 square matrix.

Because of the size of these matrices it is very slow to calculate the product not to mention it a naive implementation would use a ton of memory. I managed to take advantage of the structure of the kronecker product to turn the product into the sum of the products of each diagonal block i.e. sum.(R*kron(Diagonal(circshift([1,0,...],i)),C)*R' for i=1:nblocks). This is faster, parallelizable, and memory efficient but it is still prohibitively expensive to calculate.

For my use case I have to calculate that product many times. However, only the values of the C matrix are changing, while the rest of the matrices are constant. Since R is sparse each index of the final product should only consist of relatively few elements of C. In theory it should be faster to calculate the product symbolically once and reuse it. My first impulse was to use SymEngine to make a C matrix filled with symbols but its taking too long to calculate.

Does anyone have any alternative approach that will speed up this calculation?

I would suggest trying to split R into 2000x2500 submatrices Ri, then it should be the sum of Ri * C * Ri'. This is readily parallelisable, and there may be further optimisations available depending on the structures of Ri and C.

Actually doing what simon suggested is actually slower than my original version (below)

function transform_eprz_cov(C,R)
nr,nc = size(C.data)
colptr = vcat(1:nr:(nr*nc),fill(nr*nc+1,nr*C.n + 1 - nc))
colptr2 = Array{Int}(undef,length(colptr))
rowval = repeat(1:nr,nc)
rowval2 = Array{Int}(undef,length(rowval))
K = SparseMatrixCSC(size(C)..., colptr, rowval, vec(C.data))
S = @distributed (+) for i=1:C.n
circshift!(colptr2,colptr,(i-1)*nc)
rowval2 .= rowval .+ (i-1)*nr
K = SparseMatrixCSC(K.m,K.n,colptr2,rowval2, K.nzval)
Array((R*(R*K)'))
end
return S
end

The best I could do using Simons suggested is

function f(C,R)
S = zeros(2000,2000)
@showprogress for i=1:2500:2500*2500
Ris = R[:,i:(i+2500-1)]
nnz(Ris) == 0 && continue
Ri = Array(Ris)
S .= S .+ Ri*(Ri*C.data)'
end
S
end

The way I wrote the it in the message isnâ€™t how I implemented it. C is symmetric so what I did was R*(R*C)' and R is actually super sparse (sparsity = 0.0003).

Apart from everything @bennedich said, you are measuring the time for splitting the R into Ri.

Simonâ€™s point is that R should never exist, you should split during assembly. Also, you should not materialize the adjoint, you should probably assemble RT directly (ie R exists in your mind, the computer only ever sees the list of adjoints RiT and you compute the sum over RiT' * C * RiT).

Apparently lazy adjoint of sparse * dense is fast, and dense times sparse is fast as well, so thatâ€™s how your data wants to be; splitting into blocks is obviously right, so thatâ€™s how your data wants to be.

When considering different C, you of course reuse your RiTs::Vector{SparseMatrixCSC{Float64,Int64}} instead of re-assembling.

I canâ€™t say what the correct @threads is. BLAS can use threads as well, and in order to make optimal use of L3 you want to have as few different matrices running at the same time, while using as many cores as you have.

Turns out I can get better performance by making C sparse. It now takes a 1:30 min. I tried materializing the transpose and it didnâ€™t make a difference when C is sparse.

function f(C,R)
S = zeros(2000,2000)
Cs = sparse(C.data)
for i=1:2500:2500*2500
Ri = R[:,i:(i+2500-1)]
nnz(Ri) == 0 && continue
S .= S .+ Ri*(Ri*Cs)'
end
S
end

I would still like better performance. A custom compute kernel should be faster right? Is there a guide for using Taco from Julia anywhere.

function assemble(R)
RiTs = Vector{SparseMatrixCSC{Float64,Int64}}()
for i=1:2500:2500*2500
Ri = R[:,i:(i+2500-1)]
nnz(Ri) == 0 & continue
push!(RiTs, copy(Ri'))
end
RiTs
end
function f(C, RiTs)
S = zeros(2000,2000)
#C = sparse(C.data)
for RiT in RiTs
S .= S .+ RiT' * C * RiT
end
S
end

and both timed separately. If assemble turns out to be the expensive one, then there are ways around it.

I did try it but it doesnâ€™t make a noticable difference. The calculation is dominated by the matrix multiplication. Also as written the example you gave is very fast due to the fact the assemble routine returns an empty array (I was very confused for a little while). You needed to use && instead of & Additionally, I still see a large speed up if I make C sparse, 100s vs 165s.

Is C actually sparse (low nnz)? Or is the point rather that RiT has many zero columns? (the â€śtypicalâ€ť sparse matrix has a low but positive number of nonzero entries in every row and column)

In that case, you might be able to optimize further. This is because you now know the sparsity patterns you want: You only need C[ind_nx, ind_nz] where ind_nz is the collection of nonzero column indices of RiT.

This is because sparse matrices are much slower than dense matrices if they have almost full nnz; and your problem is that C*RiT is kinda sparse (zero columns) but dense in all nonzero columns. By selecting the nonzero columns by hand, you get the advantages of both sparsity and fast sparse*dense operations, with the bonus that you never compute unneeded rows of C*RiT.

Itâ€™s a neat idea shame it doesnâ€™t work for my case. The matrix R doesnâ€™t have any non-zero rows, therefore R' doesnâ€™t have any non-zero columns. However, about 62% of the columns of R are non-zero (nzc). So I tried to doing Ri_nzc*(Ri_nzc*C[nzc,nzc])' (Ri_nzc is precalculated, but I couldnâ€™t do the same for C[nzc,nzc] due to memory constraints) but the it was way slower; it never actually completed. I think the slicing of the C matrix was causing the massive slow down. I also tried using views of C but that didnâ€™t work as well.