I wouldn’t recommend this. A fast matrix multiply is about a lot more than SIMD-izing the loops — @turbo
(from LoopVectorization.jl), or using Tullio.jl or similar, won’t be able to do the higher-level transformations (blocking etc.) that are required to get good performance with such large matrices (where the main concern is memory access).
See also: LoopVec, Tullio losing to Matrix multiplication
If you need the subblock R[a, b]
where a
and b
are ranges, just do:
Rab = @views A[a,:] * B * A[b,:]'
and then it will use the fast BLAS matrix-multiplication implementation to compute just the subblock that you want.
if the subblock might not be square, then you should do the “small” dimension first:
Rab = @views length(a) < length(b) ? (A[a,:] * B) * A[b,:]' : A[a,:] * (B * A[b,:]')
You can save additional time by only computing the LU factorization lu(I - X)
rather than the whole matrix inverse, and then use this to compute only the desired subsets of the columns Aa = A[a,:]
and Ab = A[b,:]
:
LU = lu(I - X)
Aa = zeros(eltype(LU), size(X,1), length(a))
for i = 1:length(a); Aa[a[i], i] = 1; end # Aa = I[:, a]
ldiv!(LU, Aa) # Aa = inv(I - X)[:, a]
Ab = zeros(eltype(LU), size(X,1), length(b))
for i = 1:length(b); Ab[b[i], i] = 1; end # Ab = I[:, b]
ldiv!(LU, Ab) # Ab = inv(I - X)[:, b]
Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab') # R[a, b]
(Note that you can do even better if a
and b
overlap, in which case you can re-use the overlapping portion of the Aa
columns in Ab
. But the dominant cost will be the lu
call if a
and b
are small, so re-computing the overlap won’t matter.)
See also: Why re-use factorization(A) rather than inv(A) - #3 by stevengj : 95% of the time, if you are computing an explicit matrix inverse, you are doing something suboptimal.