Fast `diag(A' * B * A)`

Let A be a k \times n matrix and B a k\times k matrix, with k \ll n.

I want to compute the diagonal

diag(A' * B * A)

in a faster way. Here A' * B * A is a large n \times n matrix, which I would like to avoid instantiating.

Is there a faster (BLAS?) way to do this operation?

Here is a possibility:

"""
    diagprod(A, B)

Compute `diag(A' * B * A)` in an efficient way.
"""
function diagprod(A::AbstractMatrix, B::AbstractMatrix)
    @assert size(A, 1) == size(B, 1) == size(B, 2)
    return vec(sum(A .* (B * A); dims=1))
end

Can we make it faster?

1 Like

Wouldn’t a loop be faster? Or maybe Tullio can provide a happy medium?

I think a loop with LoopVectorization would blow this out of the water by removing the need for allocations

2 Likes

Your diagprod is algorithmically about as fast as one could expect and uses BLAS for the multiplication. I wouldn’t expect you to do much better. The largest remaining opportunities likely lie in reducing allocations.

You could pass in workspace variables to avoid the allocation for B * A, but that’s a bit of a nuisance and likely won’t save a ton. But you can do the A .* BA calculation in-place to re-use an existing allocation:

BA = B * A
BA .= conj.(A) .* BA
return vec(sum(BA; dims=1))

(note that I added conj to the left A since it’s actually A').

Further, you can avoid a memory pass through the data by simply using

map(LinearAlgebra.dot, eachcol(A), eachcol(B * A))

Something with LoopVectorization or Tullio could be a little faster, I expect. However, if you don’t use BLAS for the B * A calculation you’ll likely lose performance at large matrix sizes and the remaining sum(conj.(A) .* BA; dims=1) piece only has so much room for improvement.

If you’re dealing with small matrices, there’s much more room for non-BLAS solutions to shine.

EDIT: at least for some input sizes, posters below have demonstrated speedups notably bigger than what I might have anticipated by using LoopVectorization and related packages.

1 Like

I assume you meant that you wanted A to be k \times n? Here’s a few options:

#+begin_src julia
using Tullio, LoopVectorization

f1(A, B) = diag(A' * B * A)
f2(A, B) = vec(sum(A .* (B * A); dims=1))
f3(A, B) = @tullio D[i] := A[j, i] * B[j, k] * A[k, i]

function f4(A::Matrix{T}, B::Matrix{U}) where {T, U}
    V = promote_type(T, U)
    @assert size(A, 1) == size(B, 1) == size(B, 2)
    D = zeros(V, size(A, 2))
    @tturbo for i ∈ eachindex(D)
        for j ∈ axes(B, 1)
            for k ∈ axes(B, 2)
                D[i] += conj(A[j, i]) * B[j, k] * A[k, i]
            end
        end
    end
    D
end
f5(A, B) = map(LinearAlgebra.dot, eachcol(A), eachcol(B * A))

function f6(A::Matrix{T}, B::Matrix{U}) where {T, U}
    V = promote_type(T, U)
    @assert size(A, 1) == size(B, 1) == size(B, 2)
    D = Vector{promote_type(T,U)}(undef, size(A, 2))
    @tturbo for i ∈ eachindex(D)
        di = zero(eltype(D))
        for j ∈ axes(B, 1)
            for k ∈ axes(B, 2)
                di += conj(A[j, i]) * B[j, k] * A[k, i]
            end
        end
        D[i] = di
    end
end

let n = 1000, k = 10
    A = randn(k, n)
    B = randn(k, k)
    for f ∈ (f1, f2, f3, f4, f5, f6)
        print(f, "   ")
        @btime $f($A, $B)
    end
end;


#+end_src
#+RESULTS:
: f1     599.030 μs (5 allocations: 7.71 MiB)
: f2     21.550 μs (7 allocations: 164.36 KiB)
: f3     20.880 μs (1 allocation: 7.94 KiB)
: f4     4.636 μs (1 allocation: 7.94 KiB)
: f5     23.940 μs (3 allocations: 86.11 KiB)
: f6     4.481 μs (1 allocation: 7.94 KiB)

Looks like writing the manual loop in LoopVectorization.jl is the winner here.


Edit1: I missed on the of the suggestions above so I added it as f5.
Edit2: I messed up the order of indices for f4 so it was faster than it should have been. However, even with this fix it’s still the fastest option, just by a lesser margin.
Edit3: Fixed the problem pointed out here: Fast `diag(A' * B * A)` - #9 by mikmoore. This again has a negative performance impact on f4, but it’s still the fastest.
Edit4: I found yet another problem with f3 and f4 where I accidrntally wrote B[j, k] * A[j, i] instead of B[j, k] * A[k, i]. Fixing this slows down f3 and f4 a bit but f3 is hit harder than f4 and f4 remains the fastest.
Edit5: added another LoopVectorization.jl example that reduces the number of array accesses required so speeds things up a little.

11 Likes

Shouldn’t this be size(A,2)? This is doing 100x less work than it should be.

Oops, yes, it should be size(A, 2), but it’s not doing 100x more work it’s 100x less work than it should because the original poster had the wrong worder of indices for A (I think). A should be k, n not n, k if it should be multiplying B as A' * B * A.

I’ll edit my comment with the correct timings

1 Like

Sorry to not catch this all at once.

Shouldn’t this be += and initialized to D = zeros(V, size(A,2))? Or maybe use a temporary variable Di = zero(V) to do the accumulation before saving to D[i], if the compiler won’t hoist that on its own…

As written, the loop only does “work” on the final pair of j,k. The compiler may not be smart enough to have caught on, so it may not have a performance impact, but I’m suspicious that this function manages to be so much faster than even B*A despite that it should be doing “more” calculations. The bulk of that is likely the small dimension k=10 giving a meaningful opportunity to beat BLAS.

EDIT: I think there may also be an indexing bug in the expression above. It looks like something more like

D[i] += A[j,i]' * B[j,k] * A[k,i]

is correct but I derived this using different variables and might have messed up the translation.

Whoops, yes. Corrected.

Thanks a lot for the detailed response.

f4 is not working for me, some issue with LoopVectorization. What version are you on?

Here is what I tried:

function diagprod(A::Matrix{T}, B::Matrix{U}) where {T, U} # same as Mason's f4
    V = promote_type(T, U)
    @assert size(A, 1) == size(B, 1) == size(B, 2)
    D = Vector{V}(undef, size(A, 2))
    @tturbo for i ∈ eachindex(D)
        for j ∈ axes(B, 1)
            for k ∈ axes(B, 2)
                D[i] = conj(A[j, i]) * B[j, k] * A[j, i]
            end
        end
    end
end

A = randn(5, 100)
B = randn(5, 5)
diag(A' * B * A) ≈ diagprod(A, B)

This leads to a complicated error. Brieflly:

ERROR: MethodError: no method matching __vstore!(::Ptr{Float64}, ::VectorizationBase.VecUnroll{3, 2, Float64, VectorizationBase.Vec{2, Float64}}, ::Static.StaticInt{0}, ::Static.False, ::Static.False, ::Static.False, ::Static.StaticInt{16})

See stacktrace.txt · GitHub for details and stack trace. Any ideas?

Crap, yeah you’re right that I should have done += on zeros. The good news is that it does not result in a big performance loss. Updating my post yet again.

1 Like

Hm, I can’t reproduce the problem you encountered. I’ve got

LoopVectorization v0.12.158

Can you try and see if

function diagprod(A::Matrix{T}, B::Matrix{U}) where {T, U}
    V = promote_type(T, U)
    @assert size(A, 1) == size(B, 1) == size(B, 2)
    D = zeros(size(A, 2))
    @tturbo for i ∈ eachindex(D)
        for j ∈ axes(B, 1)
            for k ∈ axes(B, 2)
                D[i] += conj(A[j, i]) * B[j, k] * A[k, i]
            end
        end
    end
    D
end

A = randn(5, 100)
B = randn(5, 5)
diag(A' * B * A) ≈ diagprod(A, B)

works for you?

Yes that works. I see what’s different is you’re not using the promote_type anymore. Do you understand the problem?

Yep, I am using the same version of LoopVectorization.
I am on Julia v1.8.5, macOS.

I’m still using promote_type. The difference was just that I was traversing the correct loops. Before we accidentally did B[j, k] * A[j, i] instead of B[j, k] * A[k, i], but I don’t know why that would result in the error you saw, I didn’t get an error just a wrong answer :sweat_smile:

You’re calling V = promote_type(T, U) but then V is not used anywhere. However I confirm this is not the issue. Yes I see the index difference, that must be it.

1 Like

Oh, I see, yeah it should have been D = zeros(V, size(A, 2)), but that won’t effect this test.

I’ll update Fast `diag(A' * B * A)` - #6 by Mason

1 Like

Thanks!

Just for the record,

f6(A, B) = map(LinearAlgebra.dot, eachcol(A), Iterators.repeated(B), eachcol(A))

also exists, and uses the dot(v, B, v) notation, which specifically computes this needed form. It is slower than suggested f4 but uses the same amount of memory.

The benchmark here is for specific n and k and the results may depend on those, so @e3c6 , if there are specific values you need, it would be helpful to know (and if any of the matrices are sparse would also be interesting).

1 Like

Looks like some part of codegen is failing to reduce along one of the dimensions.
I’d be happy to walk someone through fixing this if they’re interested, otherwise I may took a look later, at least to see if fixing this would be easy (which it might be).

Also, FWIW, it may be faster to do

    D = Vector{promote_type(T,U)}(undef, size(A, 2))
    @tturbo for i ∈ eachindex(D)
        di = zero(eltype(D))
        for j ∈ axes(B, 1)
            for k ∈ axes(B, 2)
                di += conj(A[j, i]) * B[j, k] * A[k, i]
            end
        end
        D[i] = di
    end
2 Likes