SymRCM doesn't seem to help linear solver to perform

Dear all,

Yesterday I came across SymRCM.jl package, which seems to work really neatly and is easy to use, but it doesn’t seem to bring me any performance benefits. Allow me to explain. I wrote a little computational fluid dynamics program in Julia, and this is how the system matrix looks after discretization:

julia> A
26419×26419 SparseArrays.SparseMatrixCSC{Float64, Int64} with 176659 stored entries:

I made no attempt to renumber the unknowns, and the structure is … well … ugly. But here comes SymRCM, which I use in the following way:

julia> p=symrcm(A);
julia> B=A[p,p]
26419×26419 SparseArrays.SparseMatrixCSC{Float64, Int64} with 176659 stored entries:
julila> b=a[p];

The banded matrix looks really neat, SymRCM did a good job. However, if I run a few numerical test for the solution of both linear systems, this is what I get:

julia> @time A\a;
  0.214344 seconds (76 allocations: 87.890 MiB, 1.04% gc time)

julia> @time A\a;
  0.212042 seconds (76 allocations: 87.890 MiB, 1.05% gc time)

julia> @time B\b;
  0.222544 seconds (76 allocations: 85.415 MiB, 1.02% gc time)

julia> @time B\b;
  0.226223 seconds (76 allocations: 85.415 MiB, 1.00% gc time)

I tried with a bigger system too (499000x499000 matrix) and observed similar thing:

julia> @time A\a;
 10.209412 seconds (74 allocations: 2.419 GiB, 0.29% gc time)

julia> @time A\a;
 10.375933 seconds (78 allocations: 2.419 GiB, 0.87% gc time)

julia> @time A\a;
 10.203141 seconds (74 allocations: 2.419 GiB, 0.03% gc time)

julia> @time B\b;
 11.448977 seconds (78 allocations: 2.585 GiB, 0.75% gc time)

julia> @time B\b;
 11.374048 seconds (74 allocations: 2.585 GiB, 0.02% gc time)

julia> @time B\b;
 11.415554 seconds (76 allocations: 2.585 GiB, 0.24% gc time)

It seems that the banded system didn’t offer any advantage in terms of CPU time, it even seems fractionally slower. Do you guys have an idea what is going on here? Is it maybe so that linear solver being invoked with A\a also calls algorithms to reduce the matrix bandwidth?

P.S. The matrices are stored in SparseMatrixCSC format and I use LinearAlgebra and SparseArrays in addition to the SymRCM mentioned above. Quite frankly, I don’t know which solver is invoked with \ operator. I would assume one from Krylov space family. In such a case, band width might not play such a big role, but I would still assume fewer cache misses during the solution and better performance.


I don’t know what the sparse matrix package does internally but you could try the BandedMatrices package

1 Like

What if you try one of the direct solvers? cholesky or ldlt?

1 Like

I think it would be great if documentation includes example for using ldlt and cholesky as example.


Thanks for the answer Petr. It is quite conceivable that the difference for direct solvers would be significant since the bands are much narrower after the call to symrcm and simply less memory is allocated and addressed. But this is not a route I would like to take, I would stay with iterative solvers since they have much smaller memory footprint and my long-term goal is to use many unknowns over a number of GPUs.

OK, but then there is little merit in renumbering. Iterative solvers don’t care.

1 Like

In that case, you probably want to turn your matrix into a BandedMatrix

1 Like

Thanks, but I would just increase the storage. I must stick with sparse solvers.

BandedMatrix is a sparse format. It will likely take less space than a generic sparse structure.


I was expecting that renumbering unknowns will lead to fewer cache misses even in iterative solvers. But it seems I was wrong.

Hi guys,

I have some additional details. Renumbering with SymRCM doesn’t help the default solver from SparseArrays.jl (which is umfpack, to my understanding), but it does help a lot if I use a combination of a solver from IterativeSolvers.jl with LU preconditioner from IncompleteLU.jl. Depending on the linear system size, renembering reduces the time spend in the linear solver anywhere from 30% to 60% for the systems I’ve been testing - which range from 20_000 to 2_000_000 unknowns.


Note that this is consistent with our reasoning above: LU factorization needs to consider all matrix entries that could become non-zero. The reordering reduces the total number of such operations because the entries are tightly clustered around the diagonal.

1 Like

Yes, I agree, pretty consistent.