Given two dense square matrices with ComplexF64 elements of size approximately 100000X100000, A and B, what is the fastest way to perform R=ABA’, where A’=conj(tr(A)) is the Hermitian conjugate of A?.

If it helps:
(1) A=inv(I-X), where I is the identity matrix and X is another dense matrix.
(2) A and B are not necessarily Hermitian.
(3) Eventually, I need only a certain small sub-block of the result R, for instance, R[10:11,12:13].

I wouldn’t recommend this. A fast matrix multiply is about a lot more than SIMD-izing the loops — @turbo (from LoopVectorization.jl), or using Tullio.jl or similar, won’t be able to do the higher-level transformations (blocking etc.) that are required to get good performance with such large matrices (where the main concern is memory access).

You can save additional time by only computing the LU factorization lu(I - X) rather than the whole matrix inverse, and then use this to compute only the desired subsets of the columns Aa = A[a,:] and Ab = A[b,:]:

LU = lu(I - X)
Aa = zeros(eltype(LU), size(X,1), length(a))
for i = 1:length(a); Aa[a[i], i] = 1; end # Aa = I[:, a]
ldiv!(LU, Aa) # Aa = inv(I - X)[:, a]
Ab = zeros(eltype(LU), size(X,1), length(b))
for i = 1:length(b); Ab[b[i], i] = 1; end # Ab = I[:, b]
ldiv!(LU, Ab) # Ab = inv(I - X)[:, b]
Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab') # R[a, b]

(Note that you can do even better if a and b overlap, in which case you can re-use the overlapping portion of the Aa columns in Ab. But the dominant cost will be the lu call if a and b are small, so re-computing the overlap won’t matter.)

If B does not have special structure, there isn’t a lot special to be done about the multiplications A * B * A' themselves.

Only needing a small subblock of the result means that you should not compute the full product. Instead use R[i,j] == A[i,:] * B * A[j,:]' (where i/j, can be single indices or sets of indices) as suggested above.

Note that these matrices are so large (~160GB) that most systems won’t even be able to hold one in RAM, much less both. Is there further structure you can exploit? Are X and B totally arbitrary matrices or do they possess low rank or some other structure?

Unless you can do something clever about computing A = inv(I-X) itself, that calculation would be at least half your cost anyway (and almost all of it if you only need a tiny block of the result) so there are limited gains from anything else. One possibility is to use the block matrix inverse in order to compute only A[i,:] for the subset i that you need, but I don’t think that actually saves a ton here without further structure.

Alternatively, for some X you may be able to use the Neumann series, which allows that (I - X)^{-1} = \sum_{n=0}^\infty X^n when the sum converges. If it converges, then you should be able to compute this via A_N = \sum_{n=0}^N X^n where \lim_{N\rightarrow\infty} A_N = A. You can compute A_N recursively via A_{n+1} = I + A_{n}X from initial condition A_0 = I until convergence. Since you only care about A[i,:] (the rows of A indexed by i), you can instead start with A_0 = I[i,:] (the rows i of the identity matrix). In other words, the following function should either compute A[i,:] from X or throw an error if it will not work:

using LinearAlgebra
computeAi(X, ::Colon=Colon()) = computeAi(X, axes(X, 1)) # compute full matrix
function computeAi(X, i)
Ii = UniformScaling(one(eltype(X)))[i, axes(X, 2)] # I[i,:]
A = Ii
while true
Aprev = A
A = Ii + Aprev * X # could remove per-loop memory allocations with some effort
all(isfinite, A) || error("does not converge")
A == Aprev && return A # may want to make the comparison approximate
end
end

This is efficient when i is a small subset of the full dimension. When this converges, R[i,j] == computeAi(X, i) * B * computeAi(X, j)'.

Great thanks!
I guess the advantages of LU no longer remain when the subset of indices as given by the arrays a and b scale with the dimensions of the matrices A and B, right? That is, say, if A is nxn, and the indices a and b have lengths 0.1n.

Nope, it’s still better to not compute the inverse.

As I explained in the linked thread, a matrix inverse is computed by first finding the LU factors and then applying ldiv! (or equivalent) to the identity matrix. So if you need < n columns of the n \times n inverse it is better to use the LU factors directly, and for 0.1n columns you should see substantial speedup.

If I understand correctly what you are referring to, that formula only helps with computations involving square diagonal blocks of the inverse, which is not what is needed here. You’re right, you could in principle use this formula to e.g. compute an upper-left diagonal block and the off-diagonal block underneath it (or any other subset of columns via permutation thereof), but I agree that it probably won’t be faster than using the whole inverse even if you only want a small subset of the columns.

But you should be able to save more than a factor of two over computing the full inverse just by using the LU factors directly as I explained above.

That’s effectively just a crude iterative solver — if that works, probably a Krylov method like GMRES will be faster, no? Usually, however, iterative methods are not competitive for dense matrices unless the matrices are very special (e.g. if \Vert X \Vert \ll \Vert I \Vert in this case, or more generally if you have a very good preconditioner / approximate inverse). For example, if you are solving this problem lots of times with only slightly different X, you could exploit that.

using LinearAlgebra
using MKL
N = 4;
a = [1, 3]; b = [2, 4];
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);
LU = lu(I - X)
Aa = zeros(eltype(LU), length(a), size(X,1))
for i = 1:length(a)
Aa[i, a[i]] = 1
end # Aa = I[:, a]
rdiv!(Aa, LU) # Aa = inv(I - X)[:, a]
Ab = zeros(eltype(LU), length(b), size(X,1))
for i = 1:length(b)
Ab[i, b[i]] = 1;
end # Ab = I[:, b]
rdiv!(Ab, LU) # Ab = inv(I - X)[:, b]
Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab') # R[a, b]

A small correction.

In the initial solution Aa should be inv(I-X)[a,:], but instead it gave inv(I-X)[:,a]. That is, Aa picked up the columns of the inverse, instead of the rows. Same goes for Ab. So the final multiplication in Rab did not go through.