How to optimise a linear solve in a hot loop?

I have a hot loop that takes the following simplified form:

for n in eachindex(result)
    compute_A!(A)
    compute_b!(b)
    x = A \ b
    result[n] = f(x)
end

Here, A is a tall dense matrix, b is a dense vector, both having fixed size, and f(x) is a number.

Is there any way to pre allocate some sort of cache that helps me avoid memory allocations in the linear solve, or, more generaly, are there other optimisations that might be applicable?

Thank you for your assitance!

1 Like

What’s the size of A?

Around 10^5 \times 10^2.

The time to run that is probably going to be completely dominated by the QR factorization, the cost of the memory allocation is likely not noticeable next to this.

julia> @btime $A \ $b;
  110.990 ms (611 allocations: 77.16 MiB)

julia> @btime similar($A);
  4.797 μs (2 allocations: 76.29 MiB)

Are your As related to each other? If so, you can probably speed this up a bunch if you can reuse the factorization in some way.

2 Likes

Unless you have a connection between each update, as @Oscar_Smith suggested, the only thing you can optimize is the actual solution.

  1. Use the fastest solver. In case you are on x64 and the arrays are dense, try using MKL.jl. On Apple Silicon AppleAccelerate.jl can do some wonders as well.
  2. Choose manually the optimized decomposition for your matrix by doing the decomposition steps manually. For inspiration, have a look at MATLAB’s mldivide().
  3. Reduce allocations of the decomposition and solution by using FastLapackInterface.jl.

It won’t reduce much compared to the actual factorization and solving, but it is a starting point.

Even if the connection in structure of \boldsymbol{A} between iterations can not be exploited for iterative decomposition, if the solution of the previous iteration is similar to the next one, then you may consider iterative method with warmup.

Another idea, in case your system is guaranteed to have a good condition number, is to take advantage of the small number of columns relative to the number of rows and solve the Normal Equations. Solving \boldsymbol{C} \boldsymbol{x} = \boldsymbol{d} where \boldsymbol{C} = \boldsymbol{A}^{T} \boldsymbol{A} and \boldsymbol{d} = \boldsymbol{A}^{T} \boldsymbol{b}.
Now the system is a square system with 100^2 rows.
You will be able to use Cholesky (Pivoting, no check, See Cholesky Decomposition of Low Rank Positive Semidefinite Matrix) for decomposition (As \boldsymbol{C} is SPSD to the least) and it will be as fast as you can get. Still use 1, 3 from above (2 is covered by Cholesky).
It will square your condition number, so use it wisely.

7 Likes

Unfortunetly no :slightly_frowning_face:

1 Like

Solving the normal equations actually gave a huge speedup (5x ~ 10x) and a much smaller allocation. Thank you for the tip! I’ll take a look at the other sugestions as well.

1 Like