I am trying to solve a large linear system using IterativeSolvers and AlgebraicMultigrid.

Let my elemental stiffness matrix be Ke (size = 25x25xNe) where Ne is number of elements. The index for assembly is iglob (size = 5x5xNe).

My naive assembly looks like this:

```
function assemble_csc(Ke, iglob, Ne, nglob)
K_csc::SparseMatrixCSC{Float64} = spzeros(nglob,nglob)
for eo in 1:Ne
ig = iglob[:,:,eo]
K_csc[vec(ig), vec(ig)] += Ke[:,:,eo]
end
return K_csc
end
```

This simple function is super slow for very large problems so I am using FEMSparse which assembles matrix in COO format and is much much faster.

The function for that would be:

```
function assemble_coo(Ke, iglob, Ne)
K_coo = SparseMatrixCOO()
for eo in 1:Ne
ig = iglob[:,:,eo]
FEMSparse.assemble_local_matrix!(K_coo, vec(ig), vec(ig), Ke[:,:,eo])
end
return SparseMarixCSC(K_coo)
end
```

The matrices `K_csc`

and `K_coo`

are the same except for the order in which they are stored and this is giving me performance issues. Note that I have already changed the type of K_coo to be in the CSC format.

Consider simple muliplication for both the matrices:

```
@btime mul!(a, K_csc, F);
392.372 ÎĽs (0 allocations: 0 bytes)
@btime mul!(a, K_coo, F);
1.635 ms (1 allocation: 48 bytes)
```

Also, using iterativesolvers:

```
@btime cg!(d, K_coo, rhs, Pl=p, tol=1e-6);
1.897 ms (21 allocations: 904.94 KiB)
@btime cg!(d, K_csc, rhs, Pl=p2, tol=1e-6);
596.527 ÎĽs (20 allocations: 904.89 KiB)
```

Using FEMSparse is like more than ~20 times faster than my naive assembly, but that hurts my multiplication performance (especially for very large problems > 10 million degrees of freedom).

Any suggestions to get the best of both worlds? Is there a way to rearrange the SparseMatrix to be optimal for linear algebra?

EDIT: It seems that the FEMSparse method is storing some zeros as stored values, thus making the size of the sparsematrix larger than my approach, therefore yielding slower multiplication results. I am still figuring out how to get rid of these zeros.

EDIT 2: The fastest approach is either using FEMSparse or @foobar_lv2â€™s function and `SparseArrays.dropzeros!()`

. This gives me fast assembly as well as fast multiplication/cg. For fastest multiplication results, I am storing the transpose of CSC format.

Thanks all!