Block matrix linear solve performance

Hi there!

I have located a bottleneck in my simulations I could use some help with. I think this is mainly a linear algebra problem rather than pure Julia performance, but I know very little about that field so I’m not even sure about that :sweat_smile:

I have to solve the following equation:
\left(\begin{matrix} A & B \\ C & D \end{matrix}\right) \left(\begin{matrix} x \\ y \end{matrix}\right) = \left(\begin{matrix} f \\ g \end{matrix}\right)
The catch here is that A and D are constant (so much so that their inverses can be considered free), while I have N = 1-30 sets of B and C and M = 3-10 sets of f and g for each (B, C). I have A \approx A^T, D \approx D^T, and B \approx C^T, but definitively not exactly symmetric. Typical matrix sizes are 500x500 - 10,000x10,000 for both A and D, but I can chose to swap their sizes. All matrices are dense.

Things I have tried so far, with rough example times using size 6500² and 1500² matrices:

  • The naive approach: calculate a factorisation of the big matrix; I have to do this only N times (once for each set of B and C).7s/2s with 1/6 threads per LU.
  • Using the amazing Krylov.jl, specifically the gpmr method. This is very fast, however, I need to do this NxM times since I can’t reuse old solutions. I tried warm starting but it didn’t improve times. I also get limited benefit from using multiple threads (as in, @spawn some threads with a chunk of the work); I might be running up against memory bandwidth here? 0.1s per solve (with 1 thread).
  • Block LU decomposition where I can use the precalculated A^{-1} (code below). This is decent when using all threads. However, most time is spent computing the Schur complement S. I thought about using the Woodbury matrix identity but all that allows is swapping sizes of A and D (at least if I understood correctly), which I can do anyway. 3s/0.6s with 1/6 threads per LU and 0.02s for a re-solve with different f and g.
  • Well, I don’t actually need to calculate S; I tried replacing S t = y with a linear solve from Krylov.jl while using LinearOperators.jl to avoid the explicit calculation. If I use D^{-1} as a preconditioner is roughly as fast as going the gpmr route like above (which, you know, I shouldn’t be that surprised by). 0.1s per solve.
# These are the naive versions -- I use ones with storage preallocated.
function block_lu(A⁻¹, B, C, D, f, g)
    L = C * A⁻¹
    t = g - L * f
    S = D - L * B  # = D - C*A⁻¹*B
    Slu = lu!(S)
    y = Slu \ t
    x = A⁻¹ * (f - B * y)
    x, y, L, Slu
end
function re_block_lu(L, Slu, A⁻¹, B, f, g)  # same (B, C) but different (f, g)
    t = g - L * f
    y = Slu \ t
    x = A⁻¹ * (f - B * y)
    x, y
end

I feel like there might be some performance left on the table because 1) with the iterative method I’m barely using multithreading at all, and 2) with the block LU my bottleneck is now a matrix multiplication (as opposed to a linear solve) while not using the available D^{-1}.
It might very well be optimised as far as possible already, but hopefully someone has a clever idea in store. :grinning_face:

1 Like

This one is fast, but it’s just a PR of an unregistered package so it needs some revival for normal usage.

1 Like

Thanks for you reply!
I’m not sure how I’d apply this here since neither my matrices nor my block is diagonal. Is there some way by which I should bring my matrix into this shape?

For each of your B and C you have a matrix [A B;C D] so you can

  • build a block diagonal matrix where each block is some [A B_i,C_i D] ? Then aggregate the [f_i, g_i] and use the PR method.
  • The second way is just to build the full sparse matrix (block diag matrix as explained before) fill where you need but it will be less good I think especially the assemble part.
  • The third way is to change the math to actually build a global system instead of a lot of little ones, not always possible of course.

All this is true only if each block is fully independant from the solve of others blocks otherwize the method you did stay the best I think, ie if f_(i+1) and g_(i+1) depends on x_i, y_i solution of [A B_i; C_i D] [x_i,y_i] = [f_i,g_i]

The blocks are indeed independent of each other. A and D are the same in each block, and there are M sets of (f, g) for each block, but nothing depends on any other block.

I appreciate your ideas, I tried responding to them in the same order:

  • I think I could build such a block diagonal matrix Chris mentioned but if I understand the code here correctly, it just does the LU of each individual block (and later the ldiv with each block) so I’m not sure what I would gain by doing that versus doing the LU of each block straight away?
  • I will give it a shot, but since I will be building quite a big system while not really taking advantage of the constant A or D or their inverses, I’m unsure of why it might be beneficial to do so?
  • Hm, interesting idea. Could you explain how such a global system might look like and why this could be faster? I always thought smaller systems would be preferred because of the nonlinear scaling by size – but I’m afraid I’m way out of my depth here.