If you know D is a diagonal matrix, make use of this information:

using BenchmarkTools, LinearAlgebra, SparseArrays
function diag_sparse_diag(D1::Diagonal, A::AbstractSparseMatrix, D2::Diagonal)
diag1 = D1.diag
diag2 = D2.diag
@assert (length(diag1), length(diag2)) == size(A)
res = similar(A)
rows = rowvals(res)
vals_A = nonzeros(A)
vals_res = nonzeros(res)
@inbounds for col in axes(res, 2)
@simd for k in nzrange(res, col)
row = rows[k]
vals_res[k] = diag1[row] * vals_A[k] * diag2[col]
end
end
res
end
n = 6554063
nvals = 152837785
A = sprandn(n, n, nvals / n ^ 2)
D = Diagonal(randn(n))
D_sparse = sparse(D)
println(D_sparse * A * D_sparse == D * A * D == diag_sparse_diag(D, A, D))
@btime $D_sparse * $A * $D_sparse
@btime $D * $A * $D
@btime diag_sparse_diag($D, $A, $D)

Output:

true
45.017 s (16 allocations: 5.12 GiB)
4.050 s (12 allocations: 4.65 GiB)
2.415 s (6 allocations: 2.33 GiB)

If you are sure that D_sparse is a diagonal matrix, there is no cost on creating a Diagonal matrix using D = Diagonal(nonzeros(D_sparse)). In this case, the sparsity pattern of D_sparse must be exactly diagonal.

Diagonal is a much more specific structure than SparseMatrix. Specifically, CSC matrices have relatively slow access times. More specifically structured matrix types like Diagonal or Blocked or Banded, will always be faster where applicable (often 10x faster). CSC is what you use when you aren’t lucky enough to have the additional structure.

The output is a sparse matrix with dimension of (6554063, 6554063). I want to sum the output with its transpose but it is very slow. Is there any way to improve this?

@btime out2 = out + sparse(out');
29.954 s (16 allocations: 6.93 GiB)

Don’t use sparse(out') just out'. I benchmark it to be noticeably (although not a ton) faster. The main reason this is slow is that it’s allocating a ton of memory to store the result since out+out' often has 2x more nonzeros than out.

It’s 3x faster. Can you explain why it’s like that? Because "out’ " is not sparse anymore and it is adjoint (which I am not exactly sure if it’s still sparse or full). I am used to MATLAB and in MATLAB sparse is always the fastest.

Julia has lazy adjoints basically ' doesn’t copy data, it just stores the fact that you took the adjoint as part of the type info. As such for sparse matrices, the adjoint preserves sparsity.