# Reusing Sparsity patterns in sparse matrix products

I have this sparse matrix operation `R*kron(I,C)*R'` where `R` is a sparse matrix, `I` is the identity matrix, and `C` is a dense square matrix. The size of these matrices can be quite large; `R` is `2000 x 6250000` and the kronecker product is `6250000 x 6250000` but fortunately only the `C 2500 x 2500` matrix needs to be stored. The product will only be a `2000 x 2000` square matrix.

Because of the size of these matrices it is very slow to calculate the product not to mention it a naive implementation would use a ton of memory. I managed to take advantage of the structure of the kronecker product to turn the product into the sum of the products of each diagonal block i.e. `sum.(R*kron(Diagonal(circshift([1,0,...],i)),C)*R' for i=1:nblocks)`. This is faster, parallelizable, and memory efficient but it is still prohibitively expensive to calculate.

For my use case I have to calculate that product many times. However, only the values of the `C` matrix are changing, while the rest of the matrices are constant. Since `R` is sparse each index of the final product should only consist of relatively few elements of `C`. In theory it should be faster to calculate the product symbolically once and reuse it. My first impulse was to use SymEngine to make a `C` matrix filled with symbols but its taking too long to calculate.

Does anyone have any alternative approach that will speed up this calculation?

I would suggest trying to split `R` into 2000x2500 submatrices `Ri`, then it should be the sum of `Ri * C * Ri'`. This is readily parallelisable, and there may be further optimisations available depending on the structures of `Ri` and `C`.

2 Likes

Do everything simon says. In addition, respect the layout. Post warm-up:

``````julia> C=rand(500, 500);
julia> R=sprand(400, 500, 0.1);
julia> RT=copy(R'); #evals the lazy adjoint
julia> begin
@time R*C*R';
@time RT'*C*RT;
@time R*C*RT;
@time R*(C*R');
@time R*(C*RT);
@time RT'*(C*RT);
end;
3.311467 seconds (27 allocations: 6.440 MiB)
0.031931 seconds (10 allocations: 2.747 MiB)
0.033586 seconds (8 allocations: 2.747 MiB)
4.347397 seconds (37 allocations: 7.375 MiB)
0.052198 seconds (8 allocations: 2.747 MiB, 33.43% gc time)
0.033085 seconds (10 allocations: 2.747 MiB)
``````

The way you wrote it is almost the worst possible way. There is no pattern to reuse, sparse*dense gives dense (unless you are absurdly sparse).

Actually doing what simon suggested is actually slower than my original version (below)

``````function transform_eprz_cov(C,R)
nr,nc = size(C.data)
colptr = vcat(1:nr:(nr*nc),fill(nr*nc+1,nr*C.n + 1 - nc))
colptr2 = Array{Int}(undef,length(colptr))
rowval = repeat(1:nr,nc)
rowval2 = Array{Int}(undef,length(rowval))
K = SparseMatrixCSC(size(C)..., colptr, rowval, vec(C.data))
S = @distributed (+) for i=1:C.n
circshift!(colptr2,colptr,(i-1)*nc)
rowval2 .= rowval .+ (i-1)*nr
K = SparseMatrixCSC(K.m,K.n,colptr2,rowval2, K.nzval)
Array((R*(R*K)'))
end

return S
end
``````

The best I could do using Simons suggested is

``````function f(C,R)
S = zeros(2000,2000)
@showprogress for i=1:2500:2500*2500
Ris = R[:,i:(i+2500-1)]
nnz(Ris) == 0 && continue
Ri = Array(Ris)
S .= S .+ Ri*(Ri*C.data)'
end
S
end
``````

which took about 11 minutes to my original 4 min.

The way I wrote the it in the message isnâ€™t how I implemented it. `C` is symmetric so what I did was `R*(R*C)'` and R is actually super sparse (sparsity = 0.0003).

You can still gain from materializing the adjoint:

``````julia> C=rand(500, 500);

julia> R=sprand(400, 500, 0.1);

julia> @time R*(R*C)';
0.125526 seconds (15 allocations: 2.747 MiB)

julia> @time R*copy((R*C)');
0.023704 seconds (11 allocations: 4.273 MiB)
``````
1 Like

Apart from everything @bennedich said, you are measuring the time for splitting the `R` into `Ri`.

Simonâ€™s point is that `R` should never exist, you should split during assembly. Also, you should not materialize the adjoint, you should probably assemble RT directly (ie `R` exists in your mind, the computer only ever sees the list of adjoints `RiT` and you compute the sum over `RiT' * C * RiT`).

Apparently lazy adjoint of sparse * dense is fast, and dense times sparse is fast as well, so thatâ€™s how your data wants to be; splitting into blocks is obviously right, so thatâ€™s how your data wants to be.

When considering different `C`, you of course reuse your `RiTs::Vector{SparseMatrixCSC{Float64,Int64}}` instead of re-assembling.

I canâ€™t say what the correct `@threads` is. BLAS can use threads as well, and in order to make optimal use of L3 you want to have as few different matrices running at the same time, while using as many cores as you have.

1 Like

Turns out I can get better performance by making `C` sparse. It now takes a 1:30 min. I tried materializing the transpose and it didnâ€™t make a difference when C is sparse.

``````function f(C,R)
S = zeros(2000,2000)
Cs = sparse(C.data)
for i=1:2500:2500*2500
Ri = R[:,i:(i+2500-1)]
nnz(Ri) == 0 && continue
S .= S .+ Ri*(Ri*Cs)'
end
S
end
``````

I would still like better performance. A custom compute kernel should be faster right? Is there a guide for using Taco from Julia anywhere.

Have you tried what we said? I.e.

``````function assemble(R)
RiTs = Vector{SparseMatrixCSC{Float64,Int64}}()
for i=1:2500:2500*2500
Ri = R[:,i:(i+2500-1)]
nnz(Ri) == 0 & continue
push!(RiTs, copy(Ri'))
end
RiTs
end

function f(C, RiTs)
S = zeros(2000,2000)
#C = sparse(C.data)
for RiT in RiTs
S .= S .+ RiT' * C * RiT
end
S
end
``````

and both timed separately. If `assemble` turns out to be the expensive one, then there are ways around it.

I did try it but it doesnâ€™t make a noticable difference. The calculation is dominated by the matrix multiplication. Also as written the example you gave is very fast due to the fact the assemble routine returns an empty array (I was very confused for a little while). You needed to use `&&` instead of `&` Additionally, I still see a large speed up if I make `C` sparse, 100s vs 165s.

Oops.

Okay. Then I donâ€™t know what to do better.

Is `C` actually sparse (low `nnz`)? Or is the point rather that `RiT` has many zero columns? (the â€śtypicalâ€ť sparse matrix has a low but positive number of nonzero entries in every row and column)

In that case, you might be able to optimize further. This is because you now know the sparsity patterns you want: You only need `C[ind_nx, ind_nz]` where `ind_nz` is the collection of nonzero column indices of `RiT`.

This is because sparse matrices are much slower than dense matrices if they have almost full `nnz`; and your problem is that `C*RiT` is kinda sparse (zero columns) but dense in all nonzero columns. By selecting the nonzero columns by hand, you get the advantages of both sparsity and fast `sparse*dense` operations, with the bonus that you never compute unneeded rows of `C*RiT`.

Itâ€™s a neat idea shame it doesnâ€™t work for my case. The matrix `R` doesnâ€™t have any non-zero rows, therefore `R'` doesnâ€™t have any non-zero columns. However, about 62% of the columns of `R` are non-zero (`nzc`). So I tried to doing `Ri_nzc*(Ri_nzc*C[nzc,nzc])'` (`Ri_nzc` is precalculated, but I couldnâ€™t do the same for `C[nzc,nzc]` due to memory constraints) but the it was way slower; it never actually completed. I think the slicing of the `C` matrix was causing the massive slow down. I also tried using views of `C` but that didnâ€™t work as well.