The best linear solver for sparse bandedblockbandedmatricies

Hi Community,

I’m working on a program that requires solving A\B repeatedly, where A is a sparse matrix that looks like this (it’s from finite-difference of an 2d PDE).

I was wondering is there any method that can utilize its “banded-block-banded” structure as well as its sparsity to achieve higher speed? I tried to solve it as a bandedblockbandedmatricies but it’s ten times slower than lu from umfpack. I guess it’s because it’s too sparse.

I’m not super satisfied with the current method because I’m converting this program from Fortran, which also uses umfpack but runs 5 times faster than my Julia version so I was wondering any chance there is some black magic that can help. :sweat_smile:

Have you considered just using DifferentialEquations directly? It should be able to do this automatically.

If you store A as a SparseMatrixCSC (have a look at the SparseArrays package), then \ will use umfpack internally and hence almost surely be as fast as Fortran

It has a specialized QR and I think LU for these matrix types. And if you’re solving a time-dependent PDE, you can tell DifferentialEquations.jl that your matrix is a BlockBandedMatrix type (or any other matrix type that exists in the ecosystem) to specialize the internal linear solver.

1 Like

yeah that’s the problem. I did it in this way so it should be the same, but somehow my program is slower. Though the slowdown is also possibly caused by poor scaling of mutltithreading of my code. Unfortunately I don’t know too much about Fortran so cannot compare it unit by unit.

Try comparing just the time of solving the linear system in Julia to the overall runtime of your Fortran code. Julia should be at least as good as Fortran in this test. If so, the problem is with how you assemble the sparse matrix.

Also, there is no way to exploit the banded-block-banded structure of the matrix when solving a linear system other than to use a generic sparse linear system solver. LU factorisation of your banded-block-banded matrix leads to fill-in, which destroys the special structure.

2 Likes

Thanks for the suggestions. I tested BandedBlockBandedMatricies, and at this sparsity level it seems it’s slower than sparse LU. Probably I didn’t use it in the right way:

using BlockBandedMatrices, BandedMatrices, SparseArrays, BenchmarkTools, LinearAlgebra
n = 40
A = BandedBlockBandedMatrix(Zeros(n^2,n^2), Fill(n,n),Fill(n,n), (1,1), (1,1))
for K = 1:n
       view(A,Block(K,K))[band(0)] .= -4; view(A,Block(K,K))[band(1)] .= 1; view(A,Block(K,K))[band(-1)] .= 1;
end

for K = 1:n-1
       view(A,Block(K,K+1))[band(0)] .= 1; view(A,Block(K+1,K))[band(0)] .= 1;
end
RHS = ones(n*n)/n/n

@btime $A \ $RHS
Asparse = sparse(A)
@btime $Asparse \ $RHS

Output:

14.092 ms (32918 allocations: 8.76 MiB)                                                                                            │·····················································
2.877 ms (59 allocations: 2.25 MiB)    

I tried to use it once, but due to the special structure of our problem (the PDE problem is an optimal control, and the whole problem is a DAE-BVP), it’s still faster to go with our hand-written method, unfortunately.

Did you make sure that your number of BLAS threads matches your CPU cores?

If you still have the code around, it might be worth trying again. There have been a lot of recent speed improvements.

We don’t have a DAE-BVP solver, though my guess is that this person didn’t specialize the linear solver or anything like that in a single shooting setup so they probably are just comparing slow methods to slow methods :man_shrugging:

DifferentialEquations.jl can solve 2D PDEs?

Yes. The first demo of that was 2016. There’s a 2D PDE with automated sparsity handling and specialization of the linear sovlers in the second tutorial: https://diffeq.sciml.ai/latest/tutorials/advanced_ode_example/#Automatic-Sparsity-Detection-1 .

1 Like

As far as I can tell, this example requires the user to replace the PDE with a large system of coupled ODEs and then suggests to use DifferentialEquations.jl to solve these ODEs. I would say this demonstrates that DifferentialEquations.jl can handle 2D PDEs, but it’s not exactly the same as “solving” such PDEs. Solving would require that I can specify the PDE and a discretisation method, and then DifferentialEquations.jl would write out the scheme for me, just like it does for ODEs.

And it’s still a little early, but there’s DiffEqOperators.jl for helping with auto-discretizations, along with the other 5 libraries on this topic, like NeuralNetDiffEq.jl for other cases, etc.

FEniCS.jl gives you a discretization you give to DifferentialEquations:

Etc.

So that’s finite difference, mesh-free neural net methods, and FEM. Then we point to ApproxFun for pseudospectral.

2 Likes

Also BTW, these pieces are being put together to build an auto-discretizer method where PDEs are declared symbolically via ModelingToolkit and then are solved to the end. Here’s a demo of it:

But there’s really no reason to wait for that interface because you can always grab the components. That said, getting the first fully compatible solvers (via physics-informed neural networks) is one of the chosen summer projects.

@ChrisRackauckas For the BLAS part, no I didn’t, but this part of code is multithreded in a for-loop anyways so I’m not sure it’ll help. Worth trying though, will do tomorrow.

For the DifferentialEquations part, actually we had this discussion months ago on slack. It’s an HJB in an Econ model, and we use some weird implicit-eplicit method to avoid solving the Jacobian. Computing jacobian itself even with sparse diff is slower than our current method.

We would be really happy to see our problem totally outsourced to DifferentialEquations. Hand-coding time step algorithms is painful after all. When I wrote the code I always maintain the interface compatible with the package so one day probably we could use it. But at least for now the general purpose package cannot utilize the special structure of our problem so we have to do it in the crude way.

using LinearAlgebra
BLAS.set_num_threads(4)

is required on many computers due to a bug in the default setting: BLAS threads should default to physical not logical core count? · Issue #33409 · JuliaLang/julia · GitHub . You can get a pretty nice speedup on many computers just by fixing that.

I thought Jesse told you that computation was sped up ~3000x 2 weeks ago: reduce time complexity of matrix2graph by ghislainb · Pull Request #107 · JuliaDiff/SparseDiffTools.jl · GitHub . I was trying to find who you were on the Slack to see if we can test your problem out again. Your problem was also in the same range where matrix coloring was taking around 700 seconds IIRC and that was slowing the computation down, but now that should be sub-second.

1 Like

This looks exciting. Will try. Thanks.
Though our issue is computing jacobian even with colored matrix is slow. Optimal control really messed up a lot.

Insofar as the dense operations are not comparable with fortran, the other thing is to try GitHub - JuliaLinearAlgebra/MKL.jl: Intel MKL linear algebra backend for Julia which is very easy to use when the installation is working (which it seems to be right now). To go back to BLAS you can just reinstall the julia binary (and no need to do anything else). Sometimes MKL is way faster than BLAS, and sometimes they are similar (with the tweak that chris said)

For the linear algebra stuff, I see two approaches:

  • Maybe there are specialized banded-block-banded solvers or algorithms out there. I think LAPACK/MKL has DGBTRS but I don’t know if is hooked up to the blockbanded yet? But calling out to underlying fortran libraries is even enough. And I am not sure if there are algorithms that can exploit it for banded block banded. @dlfivefifty any thoughts, especially if MKL is linked in?
  • If the matrices are getting large enough, iterative methods start to become comparable here. These should be easy to try out
    • For this, you should be able to define your matrix-vector product matrix-free. See 20. Krylov Methods and Matrix Conditioning — Quantitative Economics with Julia for some notes in progess on this sort of stuff.
    • I would look to the PDE literature for preconditioners for this sort of thing. The forward-backward structure is always weird with HJBE, but once it is in matrices I think the sparsity patterns are typical of PDE discretizations. Maybe this is what multigrid preconditioners are for?
    • Since you are solving this problem repeatedly, there might be all sorts of preconditioners that are worth trying. Even if calculating them is expensive, it could speed things up dramatically later. Don’t know, but worth investigating.

But with all of that said, I am not confident that you are going to beat just using a sparse-direct solver for problems of this size. Especially if you are calculating the LU decomposition and then reusing it each time you solve the problem. But by trying out these alternative methods it will probably be time well spent for when the problems get larger.

Yes, this stuff is best discussed on slack (as are most advanced topics around differential equations). Sorry I didn’t follow up with an email.

I think there are a few things here:

  • Can the existing diffeq implement the IMEX structure used if your fortran? Chris seemed to think so, but I am of the “believe it when I see it” when it comes to optimal control problems.
  • Can the existing diffeq implement better algorithms than the one you are using without significant work? This I am very skeptical of, because control problems are just weird.
  • Can we get better algorithms than your current one built into diffeq, or at the very least a more high-performance implementation of the IMEX for larger problems? I think this is highly likely. Chris and I discussed this and he thinks more state of the art approaches from control might be useful. But this stuff can’t happen before the fall.