# Fastest way to perform A[a,:] * B * A’[:,b] where a and b are vectors of indices

This question is a follow-up to

To recap, given two dense square matrices with ComplexF64 elements of size approximately 100000X100000, A and B, what is the fastest way to perform R=ABA’, where A’=conj(tr(A)) is the Hermitian conjugate of A?.

If it helps:
(1) A=inv(I-X), where I is the identity matrix and X is another dense matrix.
(2) A and B are not necessarily Hermitian.
(3) I need only a certain sub-block of the result R, for instance, R[aa,bb], where aa,bb are vectors of indices. In the linked problem, the solution (atleast in my system, with the attached code below) looks good for very small number of elements in aa,bb. But for larger aa,bb, the LU-based inverse seems to perform significantly worse. How to go about rectifying this?

In my laptop (using julia -t8, since I have 8 cores)

1. N=1000, m=500
``````588.018 ms (13 allocations: 28.62 MiB)
33.331 ms (15 allocations: 30.33 MiB)
``````
1. N=1000, m=10
``````17.920 ms (12 allocations: 15.50 MiB)
23.468 ms (14 allocations: 17.21 MiB)
``````

The first line corresponds to the LU based inverse, while the second corresponds to the naive inverse. In my system, for anything over m=15 (1.5% of matrix dimension), the LU based inverse is slower. For larger matrices (N=10000…), I find that this threshold, where the benefits of the LU based inverse cease to exist, gets lowered.

I have appended the code.

``````using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 500;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
aa[ij] = rand(1:N);
bb[ij] = rand(1:N);
end

X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinv(X,B,a,b)
LU = lu(I - X)

Aa = zeros(eltype(LU), length(a), size(X,1))
for i = 1:length(a)
Aa[i, a[i]] = 1
end # Aa = I[:, a]
rdiv!(Aa, LU) # Aa = inv(I - X)[:, a]

Ab = zeros(eltype(LU), length(b), size(X,1))
for i = 1:length(b)
Ab[i, b[i]] = 1;
end # Ab = I[:, b]
rdiv!(Ab, LU) # Ab = inv(I - X)[:, b]

Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab'); # R[a, b]

return Rab
end

function basicinv(X,B,a,b)
G = inv(I-X);
Ga = G';

Rab = G[a,:]*B*Ga[:,b];

return Rab
end

@btime LUinv(X,B,aa,bb)

@btime basicinv(X,B,aa,bb)
``````

Something is suboptimal here, because the “regular inverse” is also LU based — it starts with the LU factorization, and then does a solve.

(For one thing, in the case where `a` and `b` overlap, you are computing the corresponding rows twice in your two `rdiv!` calls. Ideally you should take the union of `a` and `b` and then do a single `rdiv!` call.)

For example, on my machine, this has a speed almost indistinguishable from `inv(A)` for `A = rand(1000,1000)`:

``````function inv2(A)
LU = lu(A)
return ldiv!(LU, Matrix{eltype(LU)}(I, size(A)...))
end
``````

but using `rdiv!` is actually slightly faster much slower (sorry, I misread the units) than `inv(A)` for some reason:

``````function inv3(A)
LU = lu(A)
return rdiv!(Matrix{eltype(LU)}(I, size(A)...), LU)
end
``````

via:

``````julia> A = rand(1000,1000); @btime inv(\$A); @btime inv2(\$A); @btime inv3(\$A);
10.791 ms (5 allocations: 8.13 MiB)
10.796 ms (5 allocations: 15.27 MiB)
912.368 ms (5 allocations: 15.27 MiB)

julia> inv(A) ≈ inv2(A) ≈ inv3(A)
true
``````

So it seems like there is a performance bug in the `rdiv!` implementation: rdiv! on LU object is much slower than ldiv! · Issue #55422 · JuliaLang/julia · GitHub

Workaround: for now, compute `lu(I - X')` and use `ldiv!`?

2 Likes

I should have made it clear in the OP. In my case the indices a and b are explicitly non overlapping, hence I didn’t bother to factor that in.

In your case, rdiv! looks 90 times slower!, all times are in same units (ms). I get something similar on my laptop, something’s up with rdiv! (last one). First one is inv, and second one is ldiv!

``````24.354 ms (5 allocations: 9.35 MiB)
22.194 ms (5 allocations: 15.27 MiB)
603.143 ms (5 allocations: 15.27 MiB)
``````

I see.
So here is the modification with just ldiv!. Hoping that I am not doing something stupid here, it still doesn’t help.

``````using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
aa[ij] = rand(1:N);
bb[ij] = rand(1:N);
end

X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
LU = lu(I - X);

Aa = zeros(eltype(LU), length(a), size(X,1));
for i = 1:length(a)
Aa[i, a[i]] = 1;
end
rdiv!(Aa, LU)

Ab = zeros(eltype(LU), length(b), size(X,1));
for i = 1:length(b)
Ab[i, b[i]] = 1;
end
rdiv!(Ab, LU)

Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

return Rab
end

function LUinv(X,B,a,b)
LU = lu(I - X');

Aa = zeros(eltype(LU), length(a), size(X,1));
for i = 1:length(a)
Aa[i, a[i]] = 1;
end
ldiv!(LU, Aa')';

Ab = zeros(eltype(LU), length(b), size(X,1));
for i = 1:length(b)
Ab[i, b[i]] = 1;
end
ldiv!(LU, Ab')';

Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

return Rab
end

function basicinv(X,B,a,b)
G = inv(I-X);
Ga = G';

Rab = G[a,:]*B*Ga[:,b];

return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow(\$X,\$B,\$aa,\$bb);
@btime LUinv(\$X,\$B,\$aa,\$bb);
@btime basicinv(\$X,\$B,\$aa,\$bb);

``````

The ldiv!(second) code here seems to run even slower than the rdiv!(first) one.

``````true
244.594 ms (14 allocations: 20.15 MiB)
320.256 ms (13 allocations: 20.15 MiB)
30.657 ms (15 allocations: 21.86 MiB)
``````

When you construct `Aa`, construct it transposed so that you don’t have to pass `Aa'` to `ldiv!` — I suspect that you are hitting a slow generic fallback (not LAPACK).

Then, at the end, multiply `Aa' * B` (there are fast BLAS calls for multiplying by transposed matrices).

1 Like

Great, this worked, keeping the transpose outside the ldiv! helped, and it’s faster than the basic inverse version

``````using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
aa[ij] = rand(1:N);
bb[ij] = rand(1:N);
end

X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
LU = lu(I - X);

Aa = zeros(eltype(LU), length(a), size(X,1));
for i = 1:length(a)
Aa[i, a[i]] = 1;
end
rdiv!(Aa, LU);

Ab = zeros(eltype(LU), length(b), size(X,1));
for i = 1:length(b)
Ab[i, b[i]] = 1;
end
rdiv!(Ab, LU);

Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

return Rab
end

function LUinv(X,B,a,b)
LU = lu(I - X');

Aa = zeros(eltype(LU), size(X,1), length(a));
for i = 1:length(a)
Aa[a[i], i] = 1;
end
ldiv!(LU, Aa);

Ab = zeros(eltype(LU), size(X,1), length(b));
for i = 1:length(b)
Ab[b[i], i] = 1;
end
ldiv!(LU, Ab);

Rab = length(a) < length(b) ? (Aa' * B) * Ab : Aa' * (B * Ab);

return Rab
end

function basicinv(X,B,a,b)
G = inv(I-X);
Ga = G';

Rab = G[a,:]*B*Ga[:,b];

return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow(\$X,\$B,\$aa,\$bb);
@btime LUinv(\$X,\$B,\$aa,\$bb);
@btime basicinv(\$X,\$B,\$aa,\$bb);
``````
``````true
234.954 ms (13 allocations: 20.15 MiB)
20.052 ms (13 allocations: 20.15 MiB)
27.951 ms (15 allocations: 21.86 MiB)
``````
1 Like