Block matrix linear solve performance

Hi there!

I have located a bottleneck in my simulations I could use some help with. I think this is mainly a linear algebra problem rather than pure Julia performance, but I know very little about that field so I’m not even sure about that :sweat_smile:

I have to solve the following equation:
\left(\begin{matrix} A & B \\ C & D \end{matrix}\right) \left(\begin{matrix} x \\ y \end{matrix}\right) = \left(\begin{matrix} f \\ g \end{matrix}\right)
The catch here is that A and D are constant (so much so that their inverses can be considered free), while I have N = 1-30 sets of B and C and M = 3-10 sets of f and g for each (B, C). I have A \approx A^T, D \approx D^T, and B \approx C^T, but definitively not exactly symmetric. Typical matrix sizes are 500x500 - 10,000x10,000 for both A and D, but I can chose to swap their sizes. All matrices are dense.

Things I have tried so far, with rough example times using size 6500² and 1500² matrices:

  • The naive approach: calculate a factorisation of the big matrix; I have to do this only N times (once for each set of B and C).7s/2s with 1/6 threads per LU.
  • Using the amazing Krylov.jl, specifically the gpmr method. This is very fast, however, I need to do this NxM times since I can’t reuse old solutions. I tried warm starting but it didn’t improve times. I also get limited benefit from using multiple threads (as in, @spawn some threads with a chunk of the work); I might be running up against memory bandwidth here? 0.1s per solve (with 1 thread).
  • Block LU decomposition where I can use the precalculated A^{-1} (code below). This is decent when using all threads. However, most time is spent computing the Schur complement S. I thought about using the Woodbury matrix identity but all that allows is swapping sizes of A and D (at least if I understood correctly), which I can do anyway. 3s/0.6s with 1/6 threads per LU and 0.02s for a re-solve with different f and g.
  • Well, I don’t actually need to calculate S; I tried replacing S t = y with a linear solve from Krylov.jl while using LinearOperators.jl to avoid the explicit calculation. If I use D^{-1} as a preconditioner is roughly as fast as going the gpmr route like above (which, you know, I shouldn’t be that surprised by). 0.1s per solve.
# These are the naive versions -- I use ones with storage preallocated.
function block_lu(A⁻¹, B, C, D, f, g)
    L = C * A⁻¹
    t = g - L * f
    S = D - L * B  # = D - C*A⁻¹*B
    Slu = lu!(S)
    y = Slu \ t
    x = A⁻¹ * (f - B * y)
    x, y, L, Slu
end
function re_block_lu(L, Slu, A⁻¹, B, f, g)  # same (B, C) but different (f, g)
    t = g - L * f
    y = Slu \ t
    x = A⁻¹ * (f - B * y)
    x, y
end

I feel like there might be some performance left on the table because 1) with the iterative method I’m barely using multithreading at all, and 2) with the block LU my bottleneck is now a matrix multiplication (as opposed to a linear solve) while not using the available D^{-1}.
It might very well be optimised as far as possible already, but hopefully someone has a clever idea in store. :grinning_face:

2 Likes

This one is fast, but it’s just a PR of an unregistered package so it needs some revival for normal usage.

1 Like

Thanks for you reply!
I’m not sure how I’d apply this here since neither my matrices nor my block is diagonal. Is there some way by which I should bring my matrix into this shape?

For each of your B and C you have a matrix [A B;C D] so you can

  • build a block diagonal matrix where each block is some [A B_i,C_i D] ? Then aggregate the [f_i, g_i] and use the PR method.
  • The second way is just to build the full sparse matrix (block diag matrix as explained before) fill where you need but it will be less good I think especially the assemble part.
  • The third way is to change the math to actually build a global system instead of a lot of little ones, not always possible of course.

All this is true only if each block is fully independant from the solve of others blocks otherwize the method you did stay the best I think, ie if f_(i+1) and g_(i+1) depends on x_i, y_i solution of [A B_i; C_i D] [x_i,y_i] = [f_i,g_i]

The blocks are indeed independent of each other. A and D are the same in each block, and there are M sets of (f, g) for each block, but nothing depends on any other block.

I appreciate your ideas, I tried responding to them in the same order:

  • I think I could build such a block diagonal matrix Chris mentioned but if I understand the code here correctly, it just does the LU of each individual block (and later the ldiv with each block) so I’m not sure what I would gain by doing that versus doing the LU of each block straight away?
  • I will give it a shot, but since I will be building quite a big system while not really taking advantage of the constant A or D or their inverses, I’m unsure of why it might be beneficial to do so?
  • Hm, interesting idea. Could you explain how such a global system might look like and why this could be faster? I always thought smaller systems would be preferred because of the nonlinear scaling by size – but I’m afraid I’m way out of my depth here.

You could try block_gmres, using blkdiag(A⁻¹, D⁻¹) as a right preconditioner.
You can solve all variants of (f, g) for a given set of (B, C) with a right-hand side that is a matrix.
You no longer exploit the 2×2 structure of the system like gpmr, but you benefit from more BLAS2/BLAS3 operations due to the block right-hand side.

A common performance bottleneck in Julia for Krylov solvers is the sparse matrix-vector product, which is not multithreaded. MKLSparse.jl provides a significant speed-up in practice.

My last piece of advice would be to switch to a sequential BLAS via libblastrampoline (LBT) or to set the number of OpenBLAS threads to one, and launch asynchronous instances of gpmr to solve different combinations of parameters (B, C, f, g) on the CPU.
If you have an NVIDIA or AMD GPU, that’s even better, you can run the same thing as a batch of gpmr calls.

@Karajan May I ask you a few details about your application, or a reference?
I’m always curious to know where our new Krylov solvers are relevant.

cc @dpo

3 Likes

Apologies for the delay, I seem to have messed up my notifications

Thank you for all the helpful pointers! I tried to go through them all:

First the block_gmres variant. I’m not sure what exactly the blkdiag(A⁻¹, D⁻¹) is; I tried using N = BlockDiagonal([A⁻¹, D⁻¹]) (from BlockDiagonals.jl) but this was much, much slower than not providing N at all.
N = blockdiag(spdiagm(diag(A⁻¹)), spdiagm(diag(D⁻¹))) worked as well and was faster than no N. But even with this, usually gpmr will be faster for me since the right hand side has too few columns to make it worth it, according to my timings.

Speaking of this point, is there an option to make LinearOperator also work on matrices instead of vectors? If I have such skinny matrices, I imagine I’d get similar benefits as if I had a vector.

My matrices are dense so I don’t think the sparse accelerations will help in my case, but it might come in handy sometimes else.

I do have BLAS threads set to 1 already (and block_gmres is only competitive if I up the BLAS threads for that one), and a second thread only improves times by ~1.4, and any more threads don’t make any impact. I have speculated in my initial post that I might be running up against memory bandwidth, but I haven’t done any tests to confirm this.

I do have a NVIDIA GPU available (although not on all PC this will be running on) and after a bit of back and forth, I have some promising first results. What do you mean by batching gpmr calls?
The fastest method I found (so far) was to use the block variant in my initial post, and instead of doing any of the matrix multiplications or factorization (since they were slow in comparison), I used LinearOperators and cgs with M = D⁻¹ to solve for y.

The application is an integral formulation of an inductive simulation. The A and D blocks are the interaction of two pieces with themselves (which is why they stay the same) and B and C change depending on their relative position etc., for each of which I have to calculate a few things. This is how I arrived at the problem, and the gpmr seems to fit like a glove!

PS: Sorry for the deletion/edit; I managed to hit the “reply” button when like 25% done, which I figured would be more confusing than anything :sweat_smile: