Speeding up sparse matrix multiplication and assembly

I am trying to solve a large linear system using IterativeSolvers and AlgebraicMultigrid.

Let my elemental stiffness matrix be Ke (size = 25x25xNe) where Ne is number of elements. The index for assembly is iglob (size = 5x5xNe).

My naive assembly looks like this:

function assemble_csc(Ke, iglob, Ne, nglob)
    K_csc::SparseMatrixCSC{Float64} = spzeros(nglob,nglob)
    for eo in 1:Ne 
        ig = iglob[:,:,eo]
        K_csc[vec(ig), vec(ig)] += Ke[:,:,eo]
    return K_csc

This simple function is super slow for very large problems so I am using FEMSparse which assembles matrix in COO format and is much much faster.

The function for that would be:

function assemble_coo(Ke, iglob, Ne)
    K_coo = SparseMatrixCOO()
    for eo in 1:Ne 
        ig = iglob[:,:,eo]
        FEMSparse.assemble_local_matrix!(K_coo, vec(ig), vec(ig), Ke[:,:,eo])  
    return SparseMarixCSC(K_coo)

The matrices K_csc and K_coo are the same except for the order in which they are stored and this is giving me performance issues. Note that I have already changed the type of K_coo to be in the CSC format.

Consider simple muliplication for both the matrices:

@btime mul!(a, K_csc, F);
  392.372 μs (0 allocations: 0 bytes)

@btime mul!(a, K_coo, F);
  1.635 ms (1 allocation: 48 bytes)

Also, using iterativesolvers:

@btime cg!(d, K_coo, rhs, Pl=p, tol=1e-6);
  1.897 ms (21 allocations: 904.94 KiB)

@btime cg!(d, K_csc, rhs, Pl=p2, tol=1e-6);
  596.527 μs (20 allocations: 904.89 KiB)

Using FEMSparse is like more than ~20 times faster than my naive assembly, but that hurts my multiplication performance (especially for very large problems > 10 million degrees of freedom).

Any suggestions to get the best of both worlds? Is there a way to rearrange the SparseMatrix to be optimal for linear algebra?

EDIT: It seems that the FEMSparse method is storing some zeros as stored values, thus making the size of the sparsematrix larger than my approach, therefore yielding slower multiplication results. I am still figuring out how to get rid of these zeros.

EDIT 2: The fastest approach is either using FEMSparse or @foobar_lv2’s function and SparseArrays.dropzeros!(). This gives me fast assembly as well as fast multiplication/cg. For fastest multiplication results, I am storing the transpose of CSC format.

Thanks all!


Sure, look at the different constructors. The main issue is that your construction is potentially quadratic (updating sparsity structure is O(nnz); hence you must use batch constructors). You probably could use the following constructor:

julia> sparse([1,2, 1], [6,7, 6], [1.4, 2.1, 1.2], 50, 50, +)
50×50 SparseMatrixCSC{Float64,Int64} with 2 stored entries:
  [1 ,  6]  =  2.6
  [2 ,  7]  =  2.1

That would make:

function assemble_csc(Ke, iglob, Ne, nglob)
I = Vector{Int}(undef, length(Ke))
J = Vector{Int}(undef, length(Ke))
V = Vector{Float64}(undef, length(Ke))
ct = 1
for eo = 1:Ne 
v = view(iglob, :, :, eo)
for j = 1:length(v)
@inbounds for i=1:length(v)
I[ct] = v[i]
J[ct] = v[j]
V[ct] = Ke[i, j, eo]
ct += 1
return sparse(I,J,V, Ne, Ne, +)
1 Like

Why not directly assemble the sparse matrix as CSC? For example as in https://github.com/PetrKryslUCSD/FinEtools.jl/blob/master/src/AssemblyModule.jl


Thanks, I tried this. This method is exactly as fast as using FEMSparse. Thus, the assembly is ~100 times faster than my approach, but the matrix multiplication is ~ 4 times slower.

This method (and FEMSparse) is somehow storing some zeros as values. The multiplication yields the same answer, but this matrix has more stored values.

My matrix: 38801×38801 SparseMatrixCSC{Float64,Int64} with 424801 stored entries

Your matrix: 38801×38801 SparseMatrixCSC{Float64,Int64} with 1384801 stored entries

Is there any way to remove zeros from sparsematrix stored entries?

Yes, use SparseArrays.dropzeros!.

For best performance you likely want to use some reordering of the dofs, e.g. Cuthill Mckee.

If you are solving something with the same sparsity pattern multiple times, it is generally better to only create the sparsity pattern once, and then assemble directly into the CSC structure in each “time step” (e.g. as in http://kristofferc.github.io/JuAFEM.jl/dev/examples/generated/heat_equation/#Degrees-of-freedom-1).


Yes, this works perfect for my purposes. I actually need to assemble it only once before the start of the time loop, so I can use this function to build K, and I only need to multiply K*d inside the time loop, where d changes every step but K is constant.

1 Like

@kristoffer.carlsson I have a follow up question: I tried Cuthill McKee’s rcmpermute but it is giving me incorrect results. Won’t the forcing vector also have to be rearranged when doing the multiplication?

Also, my matrix is already king of banded so I don’t know if it should give me significant speedups. I have to do matrix multiplication and linear system solving at every timestep using a fixed K matrix, so even marginal speedups would add up in the entire simulation.


I am slightly confused. Why do you want the compressed-column format, when what you need is fast matrix-vector multiplication? The correct format for this is compressed-row.

1 Like

Yeah, sorry my bad.

I thought that the format was slowing down my multiplication whereas it was actually additional zeros that was the culprit. dropzeros!() fixes that.

I am still assembling as CSC format but storing it as a transpose for multiplication.