Hi there!
I have located a bottleneck in my simulations I could use some help with. I think this is mainly a linear algebra problem rather than pure Julia performance, but I know very little about that field so I’m not even sure about that
I have to solve the following equation:
\left(\begin{matrix} A & B \\ C & D \end{matrix}\right) \left(\begin{matrix} x \\ y \end{matrix}\right) = \left(\begin{matrix} f \\ g \end{matrix}\right)
The catch here is that A and D are constant (so much so that their inverses can be considered free), while I have N = 1-30 sets of B and C and M = 3-10 sets of f and g for each (B, C). I have A \approx A^T, D \approx D^T, and B \approx C^T, but definitively not exactly symmetric. Typical matrix sizes are 500x500 - 10,000x10,000 for both A and D, but I can chose to swap their sizes. All matrices are dense.
Things I have tried so far, with rough example times using size 6500² and 1500² matrices:
- The naive approach: calculate a factorisation of the big matrix; I have to do this only N times (once for each set of B and C).7s/2s with 1/6 threads per LU.
- Using the amazing Krylov.jl, specifically the
gpmr
method. This is very fast, however, I need to do this NxM times since I can’t reuse old solutions. I tried warm starting but it didn’t improve times. I also get limited benefit from using multiple threads (as in,@spawn
some threads with a chunk of the work); I might be running up against memory bandwidth here? 0.1s per solve (with 1 thread). - Block LU decomposition where I can use the precalculated A^{-1} (code below). This is decent when using all threads. However, most time is spent computing the Schur complement S. I thought about using the Woodbury matrix identity but all that allows is swapping sizes of A and D (at least if I understood correctly), which I can do anyway. 3s/0.6s with 1/6 threads per LU and 0.02s for a re-solve with different f and g.
- Well, I don’t actually need to calculate S; I tried replacing S t = y with a linear solve from Krylov.jl while using LinearOperators.jl to avoid the explicit calculation. If I use D^{-1} as a preconditioner is roughly as fast as going the
gpmr
route like above (which, you know, I shouldn’t be that surprised by). 0.1s per solve.
# These are the naive versions -- I use ones with storage preallocated.
function block_lu(A⁻¹, B, C, D, f, g)
L = C * A⁻¹
t = g - L * f
S = D - L * B # = D - C*A⁻¹*B
Slu = lu!(S)
y = Slu \ t
x = A⁻¹ * (f - B * y)
x, y, L, Slu
end
function re_block_lu(L, Slu, A⁻¹, B, f, g) # same (B, C) but different (f, g)
t = g - L * f
y = Slu \ t
x = A⁻¹ * (f - B * y)
x, y
end
I feel like there might be some performance left on the table because 1) with the iterative method I’m barely using multithreading at all, and 2) with the block LU my bottleneck is now a matrix multiplication (as opposed to a linear solve) while not using the available D^{-1}.
It might very well be optimised as far as possible already, but hopefully someone has a clever idea in store.