Fastest way to perform A[a,:] * B * A’[:,b] where a and b are vectors of indices

This question is a follow-up to

To recap, given two dense square matrices with ComplexF64 elements of size approximately 100000X100000, A and B, what is the fastest way to perform R=ABA’, where A’=conj(tr(A)) is the Hermitian conjugate of A?.

If it helps:
(1) A=inv(I-X), where I is the identity matrix and X is another dense matrix.
(2) A and B are not necessarily Hermitian.
(3) I need only a certain sub-block of the result R, for instance, R[aa,bb], where aa,bb are vectors of indices. In the linked problem, the solution (atleast in my system, with the attached code below) looks good for very small number of elements in aa,bb. But for larger aa,bb, the LU-based inverse seems to perform significantly worse. How to go about rectifying this?

In my laptop (using julia -t8, since I have 8 cores)

  1. N=1000, m=500
588.018 ms (13 allocations: 28.62 MiB)
33.331 ms (15 allocations: 30.33 MiB)
  1. N=1000, m=10
17.920 ms (12 allocations: 15.50 MiB)
23.468 ms (14 allocations: 17.21 MiB)

The first line corresponds to the LU based inverse, while the second corresponds to the naive inverse. In my system, for anything over m=15 (1.5% of matrix dimension), the LU based inverse is slower. For larger matrices (N=10000…), I find that this threshold, where the benefits of the LU based inverse cease to exist, gets lowered.

I have appended the code.

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 500;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinv(X,B,a,b)
    LU = lu(I - X)

    Aa = zeros(eltype(LU), length(a), size(X,1))
    for i = 1:length(a)
        Aa[i, a[i]] = 1
    end # Aa = I[:, a]
    rdiv!(Aa, LU) # Aa = inv(I - X)[:, a]

    Ab = zeros(eltype(LU), length(b), size(X,1))
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end # Ab = I[:, b]
    rdiv!(Ab, LU) # Ab = inv(I - X)[:, b]

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab'); # R[a, b]

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

@btime LUinv(X,B,aa,bb)

@btime basicinv(X,B,aa,bb)

Something is suboptimal here, because the “regular inverse” is also LU based — it starts with the LU factorization, and then does a solve.

(For one thing, in the case where a and b overlap, you are computing the corresponding rows twice in your two rdiv! calls. Ideally you should take the union of a and b and then do a single rdiv! call.)

For example, on my machine, this has a speed almost indistinguishable from inv(A) for A = rand(1000,1000):

function inv2(A)
    LU = lu(A)
    return ldiv!(LU, Matrix{eltype(LU)}(I, size(A)...))
end

but using rdiv! is actually slightly faster much slower (sorry, I misread the units) than inv(A) for some reason:

function inv3(A)
    LU = lu(A)
    return rdiv!(Matrix{eltype(LU)}(I, size(A)...), LU)
end

via:

julia> A = rand(1000,1000); @btime inv($A); @btime inv2($A); @btime inv3($A);
  10.791 ms (5 allocations: 8.13 MiB)
  10.796 ms (5 allocations: 15.27 MiB)
  912.368 ms (5 allocations: 15.27 MiB)

julia> inv(A) ≈ inv2(A) ≈ inv3(A)
true

So it seems like there is a performance bug in the rdiv! implementation: rdiv! on LU object is much slower than ldiv! · Issue #55422 · JuliaLang/julia · GitHub

Workaround: for now, compute lu(I - X') and use ldiv!?

2 Likes

I should have made it clear in the OP. In my case the indices a and b are explicitly non overlapping, hence I didn’t bother to factor that in.

In your case, rdiv! looks 90 times slower!, all times are in same units (ms). I get something similar on my laptop, something’s up with rdiv! (last one). First one is inv, and second one is ldiv!

24.354 ms (5 allocations: 9.35 MiB)
22.194 ms (5 allocations: 15.27 MiB)
603.143 ms (5 allocations: 15.27 MiB)

I see.
So here is the modification with just ldiv!. Hoping that I am not doing something stupid here, it still doesn’t help.

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
    LU = lu(I - X);

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end 
    rdiv!(Aa, LU)

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end 
    rdiv!(Ab, LU)

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function LUinv(X,B,a,b)
    LU = lu(I - X');

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end
    ldiv!(LU, Aa')';

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end
    ldiv!(LU, Ab')';

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow($X,$B,$aa,$bb);
@btime LUinv($X,$B,$aa,$bb);
@btime basicinv($X,$B,$aa,$bb);

The ldiv!(second) code here seems to run even slower than the rdiv!(first) one.

true
244.594 ms (14 allocations: 20.15 MiB)
320.256 ms (13 allocations: 20.15 MiB)
30.657 ms (15 allocations: 21.86 MiB)

When you construct Aa, construct it transposed so that you don’t have to pass Aa' to ldiv! — I suspect that you are hitting a slow generic fallback (not LAPACK).

Then, at the end, multiply Aa' * B (there are fast BLAS calls for multiplying by transposed matrices).

1 Like

Great, this worked, keeping the transpose outside the ldiv! helped, and it’s faster than the basic inverse version

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
    LU = lu(I - X);

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end 
    rdiv!(Aa, LU);

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end 
    rdiv!(Ab, LU);

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function LUinv(X,B,a,b)
    LU = lu(I - X');

    Aa = zeros(eltype(LU), size(X,1), length(a));
    for i = 1:length(a)
        Aa[a[i], i] = 1;
    end
    ldiv!(LU, Aa);

    Ab = zeros(eltype(LU), size(X,1), length(b));
    for i = 1:length(b)
        Ab[b[i], i] = 1;
    end
    ldiv!(LU, Ab);

    Rab = length(a) < length(b) ? (Aa' * B) * Ab : Aa' * (B * Ab);

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow($X,$B,$aa,$bb);
@btime LUinv($X,$B,$aa,$bb);
@btime basicinv($X,$B,$aa,$bb);
true
234.954 ms (13 allocations: 20.15 MiB)
20.052 ms (13 allocations: 20.15 MiB)
27.951 ms (15 allocations: 21.86 MiB)
1 Like