Reusing Sparsity patterns in sparse matrix products

I have this sparse matrix operation R*kron(I,C)*R' where R is a sparse matrix, I is the identity matrix, and C is a dense square matrix. The size of these matrices can be quite large; R is 2000 x 6250000 and the kronecker product is 6250000 x 6250000 but fortunately only the C 2500 x 2500 matrix needs to be stored. The product will only be a 2000 x 2000 square matrix.

Because of the size of these matrices it is very slow to calculate the product not to mention it a naive implementation would use a ton of memory. I managed to take advantage of the structure of the kronecker product to turn the product into the sum of the products of each diagonal block i.e. sum.(R*kron(Diagonal(circshift([1,0,...],i)),C)*R' for i=1:nblocks). This is faster, parallelizable, and memory efficient but it is still prohibitively expensive to calculate.

For my use case I have to calculate that product many times. However, only the values of the C matrix are changing, while the rest of the matrices are constant. Since R is sparse each index of the final product should only consist of relatively few elements of C. In theory it should be faster to calculate the product symbolically once and reuse it. My first impulse was to use SymEngine to make a C matrix filled with symbols but its taking too long to calculate.

Does anyone have any alternative approach that will speed up this calculation?

I would suggest trying to split R into 2000x2500 submatrices Ri, then it should be the sum of Ri * C * Ri'. This is readily parallelisable, and there may be further optimisations available depending on the structures of Ri and C.

2 Likes

Do everything simon says. In addition, respect the layout. Post warm-up:

julia> C=rand(500, 500);
julia> R=sprand(400, 500, 0.1);
julia> RT=copy(R'); #evals the lazy adjoint
julia> begin
       @time R*C*R';
       @time RT'*C*RT;
       @time R*C*RT;
       @time R*(C*R');
       @time R*(C*RT);
       @time RT'*(C*RT);
       end;
  3.311467 seconds (27 allocations: 6.440 MiB)
  0.031931 seconds (10 allocations: 2.747 MiB)
  0.033586 seconds (8 allocations: 2.747 MiB)
  4.347397 seconds (37 allocations: 7.375 MiB)
  0.052198 seconds (8 allocations: 2.747 MiB, 33.43% gc time)
  0.033085 seconds (10 allocations: 2.747 MiB)

The way you wrote it is almost the worst possible way. There is no pattern to reuse, sparse*dense gives dense (unless you are absurdly sparse).

Actually doing what simon suggested is actually slower than my original version (below)

function transform_eprz_cov(C,R)
    nr,nc = size(C.data)
    colptr = vcat(1:nr:(nr*nc),fill(nr*nc+1,nr*C.n + 1 - nc))
    colptr2 = Array{Int}(undef,length(colptr))
    rowval = repeat(1:nr,nc)
    rowval2 = Array{Int}(undef,length(rowval))
    K = SparseMatrixCSC(size(C)..., colptr, rowval, vec(C.data))
    S = @distributed (+) for i=1:C.n
        circshift!(colptr2,colptr,(i-1)*nc)
        rowval2 .= rowval .+ (i-1)*nr
        K = SparseMatrixCSC(K.m,K.n,colptr2,rowval2, K.nzval)
        Array((R*(R*K)'))
    end

    return S
end

The best I could do using Simons suggested is

function f(C,R)
    S = zeros(2000,2000)
    @showprogress for i=1:2500:2500*2500
        Ris = R[:,i:(i+2500-1)]
        nnz(Ris) == 0 && continue
        Ri = Array(Ris)
        S .= S .+ Ri*(Ri*C.data)'
    end
    S
end

which took about 11 minutes to my original 4 min.

The way I wrote the it in the message isn’t how I implemented it. C is symmetric so what I did was R*(R*C)' and R is actually super sparse (sparsity = 0.0003).

You can still gain from materializing the adjoint:

julia> C=rand(500, 500);

julia> R=sprand(400, 500, 0.1);

julia> @time R*(R*C)';
  0.125526 seconds (15 allocations: 2.747 MiB)

julia> @time R*copy((R*C)');
  0.023704 seconds (11 allocations: 4.273 MiB)
1 Like

Apart from everything @bennedich said, you are measuring the time for splitting the R into Ri.

Simon’s point is that R should never exist, you should split during assembly. Also, you should not materialize the adjoint, you should probably assemble RT directly (ie R exists in your mind, the computer only ever sees the list of adjoints RiT and you compute the sum over RiT' * C * RiT).

Apparently lazy adjoint of sparse * dense is fast, and dense times sparse is fast as well, so that’s how your data wants to be; splitting into blocks is obviously right, so that’s how your data wants to be.

When considering different C, you of course reuse your RiTs::Vector{SparseMatrixCSC{Float64,Int64}} instead of re-assembling.

I can’t say what the correct @threads is. BLAS can use threads as well, and in order to make optimal use of L3 you want to have as few different matrices running at the same time, while using as many cores as you have.

1 Like

Turns out I can get better performance by making C sparse. It now takes a 1:30 min. I tried materializing the transpose and it didn’t make a difference when C is sparse.

function f(C,R)
    S = zeros(2000,2000)
    Cs = sparse(C.data)
    for i=1:2500:2500*2500
        Ri = R[:,i:(i+2500-1)]
        nnz(Ri) == 0 && continue
        S .= S .+ Ri*(Ri*Cs)'
    end
    S
end

I would still like better performance. A custom compute kernel should be faster right? Is there a guide for using Taco from Julia anywhere.

Have you tried what we said? I.e.

function assemble(R)
    RiTs = Vector{SparseMatrixCSC{Float64,Int64}}()
    for i=1:2500:2500*2500
        Ri = R[:,i:(i+2500-1)]
        nnz(Ri) == 0 & continue
        push!(RiTs, copy(Ri'))
    end
    RiTs
end

function f(C, RiTs)
    S = zeros(2000,2000)
    #C = sparse(C.data)
    for RiT in RiTs
        S .= S .+ RiT' * C * RiT
    end
    S
end

and both timed separately. If assemble turns out to be the expensive one, then there are ways around it.

I did try it but it doesn’t make a noticable difference. The calculation is dominated by the matrix multiplication. Also as written the example you gave is very fast due to the fact the assemble routine returns an empty array (I was very confused for a little while). You needed to use && instead of & Additionally, I still see a large speed up if I make C sparse, 100s vs 165s.

Oops.

Okay. Then I don’t know what to do better.

Is C actually sparse (low nnz)? Or is the point rather that RiT has many zero columns? (the “typical” sparse matrix has a low but positive number of nonzero entries in every row and column)

In that case, you might be able to optimize further. This is because you now know the sparsity patterns you want: You only need C[ind_nx, ind_nz] where ind_nz is the collection of nonzero column indices of RiT.

This is because sparse matrices are much slower than dense matrices if they have almost full nnz; and your problem is that C*RiT is kinda sparse (zero columns) but dense in all nonzero columns. By selecting the nonzero columns by hand, you get the advantages of both sparsity and fast sparse*dense operations, with the bonus that you never compute unneeded rows of C*RiT.

It’s a neat idea shame it doesn’t work for my case. The matrix R doesn’t have any non-zero rows, therefore R' doesn’t have any non-zero columns. However, about 62% of the columns of R are non-zero (nzc). So I tried to doing Ri_nzc*(Ri_nzc*C[nzc,nzc])' (Ri_nzc is precalculated, but I couldn’t do the same for C[nzc,nzc] due to memory constraints) but the it was way slower; it never actually completed. I think the slicing of the C matrix was causing the massive slow down. I also tried using views of C but that didn’t work as well.