The best linear solver for sparse bandedblockbandedmatricies

Multigrid preconditioners are great for parabolic PDEs.

Training a surrogate can be a good idea too.

Since our demonstrations on HJBs and 2D PDEs with block banded matrices are running in the sub-second range, it’s the other way around: show us an example where it’s not performing well and we’ll figure out how to handle it. I still haven’t actually seen this example, just that “it hasn’t worked”, but is very close to things that we regularly perform optimization on (with GPUs!) in a few minutes, so I have very very good reason to assume it’s user error. On that sparsity pattern and on 2D parabolic PDEs, the first things to try are: get Jacobians in 12 function calls, parallelize within the call, multi-GPU it, or swap out to PardisioMKL, try CVODE_BDF(linear_solver=:GMRES) or a Julia-side Newton-Krylov with algebraic multigrids. I haven’t heard anything about trying those things, but if none of them work on this problem, I would really like to see what’s going on because that means something isn’t matching the theory.

unless it is not parabolic…

Where is the HJBE example that have a an “H” (i.e. they really are control problems).

I don’t think people can do this without your help. It isn’t like they haen’t tried.

These are forward-backward problems and not either standard IVPs or BVPs. There is a reason that control-theory has its own dedicated set of algorithms and theory. Dedicate someone to go through the details with @RangeFu and you can see if you can crack the problem.

The forward-backwards problems aren’t standard BVPs but they can be formulated as multipoint BVPs.

For sure, I am sure there are all sorts of mappings. Maybe the mapped versions are faster than the hand-coded setup as a sparse linear system, maybe they are slower. And I really hope we can explore them and use better algorithms.

I am just saying that it isn’t the sort of thing where the impetus is on the user to prove the existing methods don’t work well, because these things are sufficiently different from existing examples and algorithms. There are no HJBE examples I am aware of.

Blockquote
On that sparsity pattern and on 2D parabolic PDEs, the first things to try are: get Jacobians in 12 function calls, parallelize within the call, multi-GPU it, or swap out to PardisioMKL, try CVODE_BDF(linear_solver=:GMRES) or a Julia-side Newton-Krylov with algebraic multigrids.

OK let me explain why “it hasn’t worked”.

Our grid is 40 * 40 * 50 (it’s a 3d problem, but for the last dimension we use another IMEX structure to further speed up), and there is a root-finding procedure nested for each point (optimal control), so one function call already takes us 150ms without multithreading. Say with multithreading we manage to get it done in 15ms. With 12 calls to evaluate the jacobian it’s already 180ms. Let alone further steps.

Now our current method requires only 1 functional call, and 50 parallelizable A\B. LU at this scale takes 3ms, and with mutithreading here it doesn’t scale well (I’m still investigating why) but still can be done in 30-40ms. Therefore for one step in time it only takes 50ms or so. I just don’t see how any method based on Jacobian can beat this speed.

That being said, the explicit part of this problem is indeed plagued by oscillations when the model is getting complicated, and if things get worse I guess the fully implicit method by DifferentialEquations is the last resort, but that almost surely comes with a sacrifice in speed.

Hi @jlperla, thanks for all these useful resources.

Yeah for our current problem we can split the large matricies into parallelizable smaller ones so at this scale I guess sparse-direct solver should be the most efficient. At least I tried Pardiso and it’s slower than sparse LU for our problem. For some cases where we cannot split the large matrix I guess we indeed have to look into things you referred here.

That being said, we are ok with our current speed. I asked here just in case there are specialized banded-block-banded solvers out there and I somehow missed. The slowest part of our problem is calculating jacobians for the whole system, which is easily parallelizable, so if we are really dying for speed, we just distribute it over multiple nodes on the cluster and get it done quickly. The pain is always at the hand coding upwinding and time stepping but thanks to Julia it’s already way much easier than Fortran. :wink:

Indeed, that calculus is far too simple. The point of the fully implicit method is to take larger steps, and the Jacobian only needs to be calculated every few steps. If it’s taking 10x less steps but costs 3x more, that’s how its gains its performance advantage. And those oscillations are precisely what something like TRBDF2 is supposed to achieve large timesteps on due to L-stability. https://benchmarks.sciml.ai/html/MOLPDE/Filament.html that’s the 2D PDE that we usually use to test these characteristics.

I understand. Our current practice is to manually test different time steps to try to find the sparsest grid that the model still performs well. It’s a pain on the programmer side but we almost pushed it to the limit in terms of speed. Also there is another conceptual difficulty we have to work through, that is how do we put the HJB component with adaptive time stepping together with the forward PDE and other algebraic equations. As @jlperla said there must be some mapping but it’s definitely a non-trivial task, probably involving large changes to our current framework.

I profiled the BlockBandedMatrices benchmark

julia> using Profile

julia> Profile.clear()

julia> @profile for i in 1:10; A \ RHS; end

julia> Profile.print(format=:flat, C=true, sortedby=:count, mincount=5000)
 Count  Overhead File                                                                                                               Line Function
 =====  ======== ====                                                                                                               ==== ========
  5370         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           35 _blockbanded_qr!(::BlockSkylineMatrix{Float64,Array{Float64,1},BlockBandedMatrices.BlockSkylineSizes{Tuple{BlockArrays...
  6448        50 /Users/yingboma/local/Julia-1.4.app/Contents/Resources/julia/lib/julia/libopenblas64_.0.3.5.dylib                     ? dormqr_64_
  6542         5 /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/lapack.jl         2777 ormqr!(::Char, ::Char, ::SubArray{Float64,2,BlockSkylineMatrix{Float64,Array{Float64,1},BlockBandedMatrices.BlockSkyli...
  7010         1 @BlockBandedMatrices/src/blockskylineqr.jl                                                                            4 _apply_qr!
  7010         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                            8 apply_qr!
  7040         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           22 _blockbanded_qr!
  7044         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           44 qr!(::BlockSkylineMatrix{Float64,Array{Float64,1},BlockBandedMatrices.BlockSkylineSizes{Tuple{BlockArrays.BlockedUnitR...
 11166         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           71 _qr(::BlockBandedMatrices.BandedBlockBandedColumns{ArrayLayouts.ColumnMajor}, ::Tuple{Int64,Int64}, ::BandedBlockBande...
 11166         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           72 _qr(::BlockBandedMatrices.BandedBlockBandedColumns{ArrayLayouts.ColumnMajor}, ::Tuple{BlockArrays.BlockedUnitRange{Arr...
 11166         0 @ArrayLayouts/src/factorizations.jl                                                                                  47 #qr#13
 11166         0 @ArrayLayouts/src/factorizations.jl                                                                                  47 qr
 11166         0 @BlockBandedMatrices/src/blockskylineqr.jl                                                                           74 _factorize
 11166         0 @ArrayLayouts/src/factorizations.jl                                                                                  52 factorize
 13266         0 @ArrayLayouts/src/ldiv.jl                                                                                            71 _ldiv!
 13266         0 @ArrayLayouts/src/ldiv.jl                                                                                            84 copyto!
 13269         0 @ArrayLayouts/src/ldiv.jl                                                                                            21 copy
 13269         0 @ArrayLayouts/src/ldiv.jl                                                                                            22 materialize
 13270         0 @ArrayLayouts/src/ldiv.jl                                                                                           109 \(::BandedBlockBandedMatrix{Float64,BlockArrays.PseudoBlockArray{Float64,2,Array{Float64,2},Tuple{BlockArrays.BlockedU...

It looks like that most of the time is spent on qr! (dormqr). @dlfivefifty I wonder if it is intended, lu! seems like a better option here.

2 Likes

I went with QR mostly just for simplicity. QR is a bit nicer to implement since its entry-wise the same for BlockBandedMatrix as for Matrix, that is, if B isa BlockBandedMatrix we have

@test qr(B).factors == LinearAlgebra.qrfactUnblocked!(Matrix(B)).factors
@test qr(B).τ == LinearAlgebra.qrfactUnblocked!(Matrix(B)).τ

even though the factors are sparse.

LU is a bit more fiddly as the pivotting destroys the sparsity. LAPACK’s banded LU (BandedLU in BandedMatrices.jl) uses a different memory format to standard LU to overcome this, and it would certainly be possible to do this in BlockBandedMatrices.jl.

But higher priority would be a block-banded Cholesky factorization, which will be much faster than either QR or LU and cover a broad class of PDEs.

3 Likes