I have a hot loop that takes the following simplified form:

```
for n in eachindex(result)
compute_A!(A)
compute_b!(b)
x = A \ b
result[n] = f(x)
end
```

Here, `A`

is a tall dense matrix, `b`

is a dense vector, both having fixed size, and `f(x)`

is a number.

Is there any way to pre allocate some sort of cache that helps me avoid memory allocations in the linear solve, or, more generaly, are there other optimisations that might be applicable?

Thank you for your assitance!

1 Like

The time to run that is probably going to be completely dominated by the QR factorization, the cost of the memory allocation is likely not noticeable next to this.

```
julia> @btime $A \ $b;
110.990 ms (611 allocations: 77.16 MiB)
julia> @btime similar($A);
4.797 μs (2 allocations: 76.29 MiB)
```

Are your `A`

s related to each other? If so, you can probably speed this up a bunch if you can reuse the factorization in some way.

2 Likes

Unless you have a connection between each update, as @Oscar_Smith suggested, the only thing you can optimize is the actual solution.

- Use the fastest solver. In case you are on
`x64`

and the arrays are dense, try using `MKL.jl`

. On Apple Silicon `AppleAccelerate.jl`

can do some wonders as well.
- Choose manually the optimized decomposition for your matrix by doing the decomposition steps manually. For inspiration, have a look at MATLAB’s
`mldivide()`

.
- Reduce allocations of the decomposition and solution by using
`FastLapackInterface.jl`

.

It won’t reduce much compared to the actual factorization and solving, but it is a starting point.

Even if the connection in structure of \boldsymbol{A} between iterations can not be exploited for iterative decomposition, if the solution of the previous iteration is similar to the next one, then you may consider iterative method with warmup.

Another idea, **in case your system is guaranteed to have a good condition number**, is to take advantage of the small number of columns relative to the number of rows and solve the Normal Equations. Solving \boldsymbol{C} \boldsymbol{x} = \boldsymbol{d} where \boldsymbol{C} = \boldsymbol{A}^{T} \boldsymbol{A} and \boldsymbol{d} = \boldsymbol{A}^{T} \boldsymbol{b}.

Now the system is a square system with 100^2 rows.

You will be able to use Cholesky (Pivoting, no check, See Cholesky Decomposition of Low Rank Positive Semidefinite Matrix) for decomposition (As \boldsymbol{C} is SPSD to the least) and it will be as fast as you can get. Still use 1, 3 from above (2 is covered by Cholesky).

**It will square your condition number**, so use it wisely.

7 Likes

Solving the normal equations actually gave a huge speedup (5x ~ 10x) and a much smaller allocation. Thank you for the tip! I’ll take a look at the other sugestions as well.

1 Like