Speed up sparse matrix multiplication

I have a sparse matrix multiplication in my code but it’s very slow, is there anyway to make it faster?

@btime A2 = D * A * D

 50.816 s (18 allocations: 5.12 GiB)

A: adjacency matrix, 6554063×6554063 SparseMatrixCSC{Float64,Int64} with 152837785 stored entries

D: diagonal matrix, 6554063×6554063 SparseMatrixCSC{Float64,Int64} with 6554063 stored entries

If you know D is a diagonal matrix, make use of this information:

using BenchmarkTools, LinearAlgebra, SparseArrays

function diag_sparse_diag(D1::Diagonal, A::AbstractSparseMatrix, D2::Diagonal)
	diag1 = D1.diag
	diag2 = D2.diag
	@assert (length(diag1), length(diag2)) == size(A)
	res = similar(A)
	rows = rowvals(res)
	vals_A = nonzeros(A)
	vals_res = nonzeros(res)
	@inbounds for col in axes(res, 2)
		@simd for k in nzrange(res, col)
			row = rows[k]
			vals_res[k] = diag1[row] * vals_A[k] * diag2[col]

n = 6554063
nvals = 152837785
A = sprandn(n, n, nvals / n ^ 2)
D = Diagonal(randn(n))
D_sparse = sparse(D)

println(D_sparse * A * D_sparse == D * A * D == diag_sparse_diag(D, A, D))

@btime $D_sparse * $A * $D_sparse
@btime $D * $A * $D
@btime diag_sparse_diag($D, $A, $D)


  45.017 s (16 allocations: 5.12 GiB)
  4.050 s (12 allocations: 4.65 GiB)
  2.415 s (6 allocations: 2.33 GiB)

If you are sure that D_sparse is a diagonal matrix, there is no cost on creating a Diagonal matrix using D = Diagonal(nonzeros(D_sparse)). In this case, the sparsity pattern of D_sparse must be exactly diagonal.


So I see that sparse matrix multiplication is the slowest among all those. Can you explain why? I was expecting exactly the opposite.

Diagonal is a much more specific structure than SparseMatrix. Specifically, CSC matrices have relatively slow access times. More specifically structured matrix types like Diagonal or Blocked or Banded, will always be faster where applicable (often 10x faster). CSC is what you use when you aren’t lucky enough to have the additional structure.


The output is a sparse matrix with dimension of (6554063, 6554063). I want to sum the output with its transpose but it is very slow. Is there any way to improve this?

@btime out2 = out + sparse(out');

29.954 s (16 allocations: 6.93 GiB)

Don’t use sparse(out') just out'. I benchmark it to be noticeably (although not a ton) faster. The main reason this is slow is that it’s allocating a ton of memory to store the result since out+out' often has 2x more nonzeros than out.

1 Like

Digging through the source code, it looks like A + A' internally makes a SparseMatrixCSC copy of A'.

It would be nice to have an optimized method for this, since computing the Hermitian part of a matrix is a pretty common operation.


It’s 3x faster. Can you explain why it’s like that? Because "out’ " is not sparse anymore and it is adjoint (which I am not exactly sure if it’s still sparse or full). I am used to MATLAB and in MATLAB sparse is always the fastest.

Julia has lazy adjoints basically ' doesn’t copy data, it just stores the fact that you took the adjoint as part of the type info. As such for sparse matrices, the adjoint preserves sparsity.