Fastest way to perform A * B * A’

I wouldn’t recommend this. A fast matrix multiply is about a lot more than SIMD-izing the loops — @turbo (from LoopVectorization.jl), or using Tullio.jl or similar, won’t be able to do the higher-level transformations (blocking etc.) that are required to get good performance with such large matrices (where the main concern is memory access).

See also: LoopVec, Tullio losing to Matrix multiplication

If you need the subblock R[a, b] where a and b are ranges, just do:

Rab = @views A[a,:] * B * A[b,:]'

and then it will use the fast BLAS matrix-multiplication implementation to compute just the subblock that you want.

if the subblock might not be square, then you should do the “small” dimension first:

Rab = @views length(a) < length(b) ? (A[a,:] * B) * A[b,:]' : A[a,:] * (B * A[b,:]')

You can save additional time by only computing the LU factorization lu(I - X) rather than the whole matrix inverse, and then use this to compute only the desired subsets of the columns Aa = A[a,:] and Ab = A[b,:]:

LU = lu(I - X)

Aa = zeros(eltype(LU), size(X,1), length(a))
for i = 1:length(a); Aa[a[i], i] = 1; end # Aa = I[:, a]
ldiv!(LU, Aa) # Aa = inv(I - X)[:, a]

Ab = zeros(eltype(LU), size(X,1), length(b))
for i = 1:length(b); Ab[b[i], i] = 1; end # Ab = I[:, b]
ldiv!(LU, Ab) # Ab = inv(I - X)[:, b]

Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab') # R[a, b]

(Note that you can do even better if a and b overlap, in which case you can re-use the overlapping portion of the Aa columns in Ab. But the dominant cost will be the lu call if a and b are small, so re-computing the overlap won’t matter.)

See also: Why re-use factorization(A) rather than inv(A) - #3 by stevengj : 95% of the time, if you are computing an explicit matrix inverse, you are doing something suboptimal.

8 Likes