# Fast diag(A' * B * A)

Let A be a k \times n matrix and B a k\times k matrix, with k \ll n.

I want to compute the diagonal

diag(A' * B * A)


in a faster way. Here A' * B * A is a large n \times n matrix, which I would like to avoid instantiating.

Is there a faster (BLAS?) way to do this operation?

Here is a possibility:

"""
diagprod(A, B)

Compute diag(A' * B * A) in an efficient way.
"""
function diagprod(A::AbstractMatrix, B::AbstractMatrix)
@assert size(A, 1) == size(B, 1) == size(B, 2)
return vec(sum(A .* (B * A); dims=1))
end


Can we make it faster?

1 Like

Wouldnât a loop be faster? Or maybe Tullio can provide a happy medium?

I think a loop with LoopVectorization would blow this out of the water by removing the need for allocations

2 Likes

Your diagprod is algorithmically about as fast as one could expect and uses BLAS for the multiplication. I wouldnât expect you to do much better. The largest remaining opportunities likely lie in reducing allocations.

You could pass in workspace variables to avoid the allocation for B * A, but thatâs a bit of a nuisance and likely wonât save a ton. But you can do the A .* BA calculation in-place to re-use an existing allocation:

BA = B * A
BA .= conj.(A) .* BA
return vec(sum(BA; dims=1))


(note that I added conj to the left A since itâs actually A').

Further, you can avoid a memory pass through the data by simply using

map(LinearAlgebra.dot, eachcol(A), eachcol(B * A))


Something with LoopVectorization or Tullio could be a little faster, I expect. However, if you donât use BLAS for the B * A calculation youâll likely lose performance at large matrix sizes and the remaining sum(conj.(A) .* BA; dims=1) piece only has so much room for improvement.

If youâre dealing with small matrices, thereâs much more room for non-BLAS solutions to shine.

EDIT: at least for some input sizes, posters below have demonstrated speedups notably bigger than what I might have anticipated by using LoopVectorization and related packages.

1 Like

I assume you meant that you wanted A to be k \times n? Hereâs a few options:

#+begin_src julia
using Tullio, LoopVectorization

f1(A, B) = diag(A' * B * A)
f2(A, B) = vec(sum(A .* (B * A); dims=1))
f3(A, B) = @tullio D[i] := A[j, i] * B[j, k] * A[k, i]

function f4(A::Matrix{T}, B::Matrix{U}) where {T, U}
V = promote_type(T, U)
@assert size(A, 1) == size(B, 1) == size(B, 2)
D = zeros(V, size(A, 2))
@tturbo for i â eachindex(D)
for j â axes(B, 1)
for k â axes(B, 2)
D[i] += conj(A[j, i]) * B[j, k] * A[k, i]
end
end
end
D
end
f5(A, B) = map(LinearAlgebra.dot, eachcol(A), eachcol(B * A))

function f6(A::Matrix{T}, B::Matrix{U}) where {T, U}
V = promote_type(T, U)
@assert size(A, 1) == size(B, 1) == size(B, 2)
D = Vector{promote_type(T,U)}(undef, size(A, 2))
@tturbo for i â eachindex(D)
di = zero(eltype(D))
for j â axes(B, 1)
for k â axes(B, 2)
di += conj(A[j, i]) * B[j, k] * A[k, i]
end
end
D[i] = di
end
end

let n = 1000, k = 10
A = randn(k, n)
B = randn(k, k)
for f â (f1, f2, f3, f4, f5, f6)
print(f, "   ")
@btime $f($A, \$B)
end
end;

#+end_src

#+RESULTS:
: f1     599.030 Îźs (5 allocations: 7.71 MiB)
: f2     21.550 Îźs (7 allocations: 164.36 KiB)
: f3     20.880 Îźs (1 allocation: 7.94 KiB)
: f4     4.636 Îźs (1 allocation: 7.94 KiB)
: f5     23.940 Îźs (3 allocations: 86.11 KiB)
: f6     4.481 Îźs (1 allocation: 7.94 KiB)


Looks like writing the manual loop in LoopVectorization.jl is the winner here.

Edit1: I missed on the of the suggestions above so I added it as f5.
Edit2: I messed up the order of indices for f4 so it was faster than it should have been. However, even with this fix itâs still the fastest option, just by a lesser margin.
Edit3: Fixed the problem pointed out here: Fast diag(A' * B * A) - #9 by mikmoore. This again has a negative performance impact on f4, but itâs still the fastest.
Edit4: I found yet another problem with f3 and f4 where I accidrntally wrote B[j, k] * A[j, i] instead of B[j, k] * A[k, i]. Fixing this slows down f3 and f4 a bit but f3 is hit harder than f4 and f4 remains the fastest.
Edit5: added another LoopVectorization.jl example that reduces the number of array accesses required so speeds things up a little.

11 Likes

Shouldnât this be size(A,2)? This is doing 100x less work than it should be.

Oops, yes, it should be size(A, 2), but itâs not doing 100x more work itâs 100x less work than it should because the original poster had the wrong worder of indices for A (I think). A should be k, n not n, k if it should be multiplying B as A' * B * A.

Iâll edit my comment with the correct timings

1 Like

Sorry to not catch this all at once.

Shouldnât this be += and initialized to D = zeros(V, size(A,2))? Or maybe use a temporary variable Di = zero(V) to do the accumulation before saving to D[i], if the compiler wonât hoist that on its ownâŚ

As written, the loop only does âworkâ on the final pair of j,k. The compiler may not be smart enough to have caught on, so it may not have a performance impact, but Iâm suspicious that this function manages to be so much faster than even B*A despite that it should be doing âmoreâ calculations. The bulk of that is likely the small dimension k=10 giving a meaningful opportunity to beat BLAS.

EDIT: I think there may also be an indexing bug in the expression above. It looks like something more like

D[i] += A[j,i]' * B[j,k] * A[k,i]


is correct but I derived this using different variables and might have messed up the translation.

Whoops, yes. Corrected.

Thanks a lot for the detailed response.

f4 is not working for me, some issue with LoopVectorization. What version are you on?

Here is what I tried:

function diagprod(A::Matrix{T}, B::Matrix{U}) where {T, U} # same as Mason's f4
V = promote_type(T, U)
@assert size(A, 1) == size(B, 1) == size(B, 2)
D = Vector{V}(undef, size(A, 2))
@tturbo for i â eachindex(D)
for j â axes(B, 1)
for k â axes(B, 2)
D[i] = conj(A[j, i]) * B[j, k] * A[j, i]
end
end
end
end

A = randn(5, 100)
B = randn(5, 5)
diag(A' * B * A) â diagprod(A, B)


This leads to a complicated error. Brieflly:

ERROR: MethodError: no method matching __vstore!(::Ptr{Float64}, ::VectorizationBase.VecUnroll{3, 2, Float64, VectorizationBase.Vec{2, Float64}}, ::Static.StaticInt{0}, ::Static.False, ::Static.False, ::Static.False, ::Static.StaticInt{16})


See stacktrace.txt Âˇ GitHub for details and stack trace. Any ideas?

Crap, yeah youâre right that I should have done += on zeros. The good news is that it does not result in a big performance loss. Updating my post yet again.

1 Like

Hm, I canât reproduce the problem you encountered. Iâve got

LoopVectorization v0.12.158


Can you try and see if

function diagprod(A::Matrix{T}, B::Matrix{U}) where {T, U}
V = promote_type(T, U)
@assert size(A, 1) == size(B, 1) == size(B, 2)
D = zeros(size(A, 2))
@tturbo for i â eachindex(D)
for j â axes(B, 1)
for k â axes(B, 2)
D[i] += conj(A[j, i]) * B[j, k] * A[k, i]
end
end
end
D
end

A = randn(5, 100)
B = randn(5, 5)
diag(A' * B * A) â diagprod(A, B)


works for you?

Yes that works. I see whatâs different is youâre not using the promote_type anymore. Do you understand the problem?

Yep, I am using the same version of LoopVectorization.
I am on Julia v1.8.5, macOS.

Iâm still using promote_type. The difference was just that I was traversing the correct loops. Before we accidentally did B[j, k] * A[j, i] instead of B[j, k] * A[k, i], but I donât know why that would result in the error you saw, I didnât get an error just a wrong answer

Youâre calling V = promote_type(T, U) but then V is not used anywhere. However I confirm this is not the issue. Yes I see the index difference, that must be it.

1 Like

Oh, I see, yeah it should have been D = zeros(V, size(A, 2)), but that wonât effect this test.

Iâll update Fast diag(A' * B * A) - #6 by Mason

1 Like

Thanks!

Just for the record,

f6(A, B) = map(LinearAlgebra.dot, eachcol(A), Iterators.repeated(B), eachcol(A))


also exists, and uses the dot(v, B, v) notation, which specifically computes this needed form. It is slower than suggested f4 but uses the same amount of memory.

The benchmark here is for specific n and k and the results may depend on those, so @e3c6 , if there are specific values you need, it would be helpful to know (and if any of the matrices are sparse would also be interesting).

1 Like

Looks like some part of codegen is failing to reduce along one of the dimensions.
Iâd be happy to walk someone through fixing this if theyâre interested, otherwise I may took a look later, at least to see if fixing this would be easy (which it might be).

Also, FWIW, it may be faster to do

    D = Vector{promote_type(T,U)}(undef, size(A, 2))
@tturbo for i â eachindex(D)
di = zero(eltype(D))
for j â axes(B, 1)
for k â axes(B, 2)
di += conj(A[j, i]) * B[j, k] * A[k, i]
end
end
D[i] = di
end

2 Likes