Speeding up sparse matrix multiplication and assembly

thehalfspace · June 19, 2019, 6:09pm

I am trying to solve a large linear system using IterativeSolvers and AlgebraicMultigrid.

Let my elemental stiffness matrix be Ke (size = 25x25xNe) where Ne is number of elements. The index for assembly is iglob (size = 5x5xNe).

My naive assembly looks like this:

function assemble_csc(Ke, iglob, Ne, nglob)
    K_csc::SparseMatrixCSC{Float64} = spzeros(nglob,nglob)
    for eo in 1:Ne 
        ig = iglob[:,:,eo]
        K_csc[vec(ig), vec(ig)] += Ke[:,:,eo]
    end
    return K_csc
end

This simple function is super slow for very large problems so I am using FEMSparse which assembles matrix in COO format and is much much faster.

The function for that would be:

function assemble_coo(Ke, iglob, Ne)
    K_coo = SparseMatrixCOO()
    for eo in 1:Ne 
        ig = iglob[:,:,eo]
        FEMSparse.assemble_local_matrix!(K_coo, vec(ig), vec(ig), Ke[:,:,eo])  
    end
    return SparseMarixCSC(K_coo)
end

The matrices K_csc and K_coo are the same except for the order in which they are stored and this is giving me performance issues. Note that I have already changed the type of K_coo to be in the CSC format.

Consider simple muliplication for both the matrices:

@btime mul!(a, K_csc, F);
  392.372 μs (0 allocations: 0 bytes)

@btime mul!(a, K_coo, F);
  1.635 ms (1 allocation: 48 bytes)

Also, using iterativesolvers:

@btime cg!(d, K_coo, rhs, Pl=p, tol=1e-6);
  1.897 ms (21 allocations: 904.94 KiB)

@btime cg!(d, K_csc, rhs, Pl=p2, tol=1e-6);
  596.527 μs (20 allocations: 904.89 KiB)

Using FEMSparse is like more than ~20 times faster than my naive assembly, but that hurts my multiplication performance (especially for very large problems > 10 million degrees of freedom).

Any suggestions to get the best of both worlds? Is there a way to rearrange the SparseMatrix to be optimal for linear algebra?

EDIT: It seems that the FEMSparse method is storing some zeros as stored values, thus making the size of the sparsematrix larger than my approach, therefore yielding slower multiplication results. I am still figuring out how to get rid of these zeros.

EDIT 2: The fastest approach is either using FEMSparse or @foobar_lv2’s function and SparseArrays.dropzeros!(). This gives me fast assembly as well as fast multiplication/cg. For fastest multiplication results, I am storing the transpose of CSC format.

Thanks all!

foobar_lv2 · June 19, 2019, 6:49pm

Sure, look at the different constructors. The main issue is that your construction is potentially quadratic (updating sparsity structure is O(nnz); hence you must use batch constructors). You probably could use the following constructor:

julia> sparse([1,2, 1], [6,7, 6], [1.4, 2.1, 1.2], 50, 50, +)
50×50 SparseMatrixCSC{Float64,Int64} with 2 stored entries:
  [1 ,  6]  =  2.6
  [2 ,  7]  =  2.1

That would make:

function assemble_csc(Ke, iglob, Ne, nglob)
I = Vector{Int}(undef, length(Ke))
J = Vector{Int}(undef, length(Ke))
V = Vector{Float64}(undef, length(Ke))
ct = 1
for eo = 1:Ne 
v = view(iglob, :, :, eo)
for j = 1:length(v)
@inbounds for i=1:length(v)
I[ct] = v[i]
J[ct] = v[j]
V[ct] = Ke[i, j, eo]
ct += 1
end
end
end
return sparse(I,J,V, Ne, Ne, +)
end

PetrKryslUCSD · June 19, 2019, 6:51pm

Why not directly assemble the sparse matrix as CSC? For example as in https://github.com/PetrKryslUCSD/FinEtools.jl/blob/master/src/AssemblyModule.jl

thehalfspace · June 19, 2019, 9:38pm

Thanks, I tried this. This method is exactly as fast as using FEMSparse. Thus, the assembly is ~100 times faster than my approach, but the matrix multiplication is ~ 4 times slower.

This method (and FEMSparse) is somehow storing some zeros as values. The multiplication yields the same answer, but this matrix has more stored values.

My matrix: 38801×38801 SparseMatrixCSC{Float64,Int64} with 424801 stored entries

Your matrix: 38801×38801 SparseMatrixCSC{Float64,Int64} with 1384801 stored entries

Is there any way to remove zeros from sparsematrix stored entries?

kristoffer.carlsson · June 19, 2019, 9:48pm

Yes, use SparseArrays.dropzeros!.

For best performance you likely want to use some reordering of the dofs, e.g. Cuthill Mckee.

If you are solving something with the same sparsity pattern multiple times, it is generally better to only create the sparsity pattern once, and then assemble directly into the CSC structure in each “time step” (e.g. as in http://kristofferc.github.io/JuAFEM.jl/dev/examples/generated/heat_equation/#Degrees-of-freedom-1).

thehalfspace · June 19, 2019, 9:56pm

Yes, this works perfect for my purposes. I actually need to assemble it only once before the start of the time loop, so I can use this function to build K, and I only need to multiply K*d inside the time loop, where d changes every step but K is constant.

thehalfspace · June 20, 2019, 2:49pm

@kristoffer.carlsson I have a follow up question: I tried Cuthill McKee’s rcmpermute but it is giving me incorrect results. Won’t the forcing vector also have to be rearranged when doing the multiplication?

Also, my matrix is already king of banded so I don’t know if it should give me significant speedups. I have to do matrix multiplication and linear system solving at every timestep using a fixed K matrix, so even marginal speedups would add up in the entire simulation.

Thanks

PetrKryslUCSD · June 24, 2019, 5:35am

I am slightly confused. Why do you want the compressed-column format, when what you need is fast matrix-vector multiplication? The correct format for this is compressed-row.

thehalfspace · June 24, 2019, 6:37pm

Yeah, sorry my bad.

I thought that the format was slowing down my multiplication whereas it was actually additional zeros that was the culprit. dropzeros!() fixes that.

I am still assembling as CSC format but storing it as a transpose for multiplication.

Topic		Replies	Views
A lot of big sparse matrices Performance	8	465	December 12, 2020
Sparse matrix multiplication: SparseMatrixCSC can be ~100x slower than Base.SparseArrays.CHOLMOD.Sparse General Usage performance	4	5028	February 6, 2017
CSC kills the prospect of multithreading. Shouldn't Julia use CSR? Julia at Scale sparse	47	2999	March 20, 2024
SparseMatrix is very slow General Usage question	9	1975	March 31, 2020
Asymmetric speed of in-place `sparse*dense` matrix product General Usage	7	1531	November 8, 2018

Speeding up sparse matrix multiplication and assembly

Related topics