Fastest way to perform A[a,:] * B * A’[:,b] where a and b are vectors of indices

eveningsilverfox · August 8, 2024, 4:09pm

This question is a follow-up to

To recap, given two dense square matrices with ComplexF64 elements of size approximately 100000X100000, A and B, what is the fastest way to perform R=ABA’, where A’=conj(tr(A)) is the Hermitian conjugate of A?.

If it helps:
(1) A=inv(I-X), where I is the identity matrix and X is another dense matrix.
(2) A and B are not necessarily Hermitian.
(3) I need only a certain sub-block of the result R, for instance, R[aa,bb], where aa,bb are vectors of indices. In the linked problem, the solution (atleast in my system, with the attached code below) looks good for very small number of elements in aa,bb. But for larger aa,bb, the LU-based inverse seems to perform significantly worse. How to go about rectifying this?

In my laptop (using julia -t8, since I have 8 cores)

N=1000, m=500

588.018 ms (13 allocations: 28.62 MiB)
33.331 ms (15 allocations: 30.33 MiB)

N=1000, m=10

17.920 ms (12 allocations: 15.50 MiB)
23.468 ms (14 allocations: 17.21 MiB)

The first line corresponds to the LU based inverse, while the second corresponds to the naive inverse. In my system, for anything over m=15 (1.5% of matrix dimension), the LU based inverse is slower. For larger matrices (N=10000…), I find that this threshold, where the benefits of the LU based inverse cease to exist, gets lowered.

I have appended the code.

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 500;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinv(X,B,a,b)
    LU = lu(I - X)

    Aa = zeros(eltype(LU), length(a), size(X,1))
    for i = 1:length(a)
        Aa[i, a[i]] = 1
    end # Aa = I[:, a]
    rdiv!(Aa, LU) # Aa = inv(I - X)[:, a]

    Ab = zeros(eltype(LU), length(b), size(X,1))
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end # Ab = I[:, b]
    rdiv!(Ab, LU) # Ab = inv(I - X)[:, b]

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab'); # R[a, b]

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

@btime LUinv(X,B,aa,bb)

@btime basicinv(X,B,aa,bb)

stevengj · August 8, 2024, 6:00pm

Something is suboptimal here, because the “regular inverse” is also LU based — it starts with the LU factorization, and then does a solve.

(For one thing, in the case where a and b overlap, you are computing the corresponding rows twice in your two rdiv! calls. Ideally you should take the union of a and b and then do a single rdiv! call.)

For example, on my machine, this has a speed almost indistinguishable from inv(A) for A = rand(1000,1000):

function inv2(A)
    LU = lu(A)
    return ldiv!(LU, Matrix{eltype(LU)}(I, size(A)...))
end

but using rdiv! ~~is actually slightly faster~~ much slower (sorry, I misread the units) than inv(A) for some reason:

function inv3(A)
    LU = lu(A)
    return rdiv!(Matrix{eltype(LU)}(I, size(A)...), LU)
end

via:

julia> A = rand(1000,1000); @btime inv($A); @btime inv2($A); @btime inv3($A);
  10.791 ms (5 allocations: 8.13 MiB)
  10.796 ms (5 allocations: 15.27 MiB)
  912.368 ms (5 allocations: 15.27 MiB)

julia> inv(A) ≈ inv2(A) ≈ inv3(A)
true

So it seems like there is a performance bug in the rdiv! implementation: rdiv! on LU object is much slower than ldiv! · Issue #55422 · JuliaLang/julia · GitHub

Workaround: for now, compute lu(I - X') and use ldiv!?

eveningsilverfox · August 8, 2024, 6:16pm

I should have made it clear in the OP. In my case the indices a and b are explicitly non overlapping, hence I didn’t bother to factor that in.

In your case, rdiv! looks 90 times slower!, all times are in same units (ms). I get something similar on my laptop, something’s up with rdiv! (last one). First one is inv, and second one is ldiv!

24.354 ms (5 allocations: 9.35 MiB)
22.194 ms (5 allocations: 15.27 MiB)
603.143 ms (5 allocations: 15.27 MiB)

eveningsilverfox · August 8, 2024, 8:20pm

I see.
So here is the modification with just ldiv!. Hoping that I am not doing something stupid here, it still doesn’t help.

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
    LU = lu(I - X);

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end 
    rdiv!(Aa, LU)

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end 
    rdiv!(Ab, LU)

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function LUinv(X,B,a,b)
    LU = lu(I - X');

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end
    ldiv!(LU, Aa')';

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end
    ldiv!(LU, Ab')';

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow($X,$B,$aa,$bb);
@btime LUinv($X,$B,$aa,$bb);
@btime basicinv($X,$B,$aa,$bb);

The ldiv!(second) code here seems to run even slower than the rdiv!(first) one.

true
244.594 ms (14 allocations: 20.15 MiB)
320.256 ms (13 allocations: 20.15 MiB)
30.657 ms (15 allocations: 21.86 MiB)

stevengj · August 8, 2024, 8:27pm

When you construct Aa, construct it transposed so that you don’t have to pass Aa' to ldiv! — I suspect that you are hitting a slow generic fallback (not LAPACK).

Then, at the end, multiply Aa' * B (there are fast BLAS calls for multiplying by transposed matrices).

eveningsilverfox · August 8, 2024, 8:37pm

Great, this worked, keeping the transpose outside the ldiv! helped, and it’s faster than the basic inverse version

using LinearAlgebra
using MKL
using BenchmarkTools

N = 1000;
m = 200;

aa = zeros(Int32, m); bb = zeros(Int32, m);
for ij = 1:m
    aa[ij] = rand(1:N);
    bb[ij] = rand(1:N);
end
    
X = 0.1 .* rand(Float64, N,N);
B = 1 .* rand(Float64, N,N);

function LUinvslow(X,B,a,b)
    LU = lu(I - X);

    Aa = zeros(eltype(LU), length(a), size(X,1));
    for i = 1:length(a)
        Aa[i, a[i]] = 1;
    end 
    rdiv!(Aa, LU);

    Ab = zeros(eltype(LU), length(b), size(X,1));
    for i = 1:length(b)
        Ab[i, b[i]] = 1;
    end 
    rdiv!(Ab, LU);

    Rab = length(a) < length(b) ? (Aa * B) * Ab' : Aa * (B * Ab');

    return Rab
end

function LUinv(X,B,a,b)
    LU = lu(I - X');

    Aa = zeros(eltype(LU), size(X,1), length(a));
    for i = 1:length(a)
        Aa[a[i], i] = 1;
    end
    ldiv!(LU, Aa);

    Ab = zeros(eltype(LU), size(X,1), length(b));
    for i = 1:length(b)
        Ab[b[i], i] = 1;
    end
    ldiv!(LU, Ab);

    Rab = length(a) < length(b) ? (Aa' * B) * Ab : Aa' * (B * Ab);

    return Rab
end

function basicinv(X,B,a,b)
    G = inv(I-X);
    Ga = G';

    Rab = G[a,:]*B*Ga[:,b];

    return Rab
end

LUinv(X,B,aa,bb) ≈ basicinv(X,B,aa,bb)

@btime LUinvslow($X,$B,$aa,$bb);
@btime LUinv($X,$B,$aa,$bb);
@btime basicinv($X,$B,$aa,$bb);

true
234.954 ms (13 allocations: 20.15 MiB)
20.052 ms (13 allocations: 20.15 MiB)
27.951 ms (15 allocations: 21.86 MiB)

Topic		Replies	Views
Fastest way to perform A * B * A’ General Usage question , blas , mkl , linearalgebra , optimization	9	585	August 13, 2024
Fast `diag(A' * B * A)` General Usage performance , linearalgebra	22	874	May 4, 2023
Non-allocating A / B where A and B are matrices General Usage	8	1744	March 7, 2019
Performance inv & solve in Julia & R Performance	15	1429	August 11, 2020
Most efficient way of computing B'* inv(A)* B for A large Specific Domains linearalgebra	30	816	September 30, 2020

Fastest way to perform A[a,:] * B * A’[:,b] where a and b are vectors of indices

Related topics