# Performance of computing the diagonal of V' * A *V

I want to compute the diagonal elements of

X = V^\top A V
• where V is an orthogonal matrix
• A is a symmetric matrix

I noticed that it is much faster to just compute X and then extract the diagonal elements than computing only the diagonal elements directly.

using LinearAlgebra, SparseArrays, Random, BenchmarkTools

rng = Random.MersenneTwister(23163213)

# my current approach
function diagonal_elements!(diag_elem::Vector{T}, A::AbstractMatrix{T}, V::AbstractMatrix{T}, temp::Vector{T}) where {T <: AbstractFloat}

n = size(A, 2)
@inbounds for i = 1:n
vi =  view(V, :, i)
mul!(temp, A, vi)
diag_elem[i] = dot(vi, temp)
end
return nothing
end

# naive (but faster?) way
function diagonal_elements2!(diag_elem::Vector{T}, A::AbstractMatrix{T}, V::AbstractMatrix{T}) where {T <: AbstractFloat}

Λ = V' * A * V
@inbounds @simd for i in eachindex(diag_elem)
diag_elem[i] = Λ[i, i]
end
return nothing
end

# make symmetric matrix
n = 2000
A = rand(rng, n, n)
A = 0.5 * (A + A')

# A = Symmetric(A) # seems to slow things down

# make an orthogonal matrix
V = eigen(Symmetric(rand(rng, n, n))).vectors

# preallocate
diag_elem = zeros(n)
diag_elem2 = zeros(n)

# compute diagonal of V' * A * V where A is symmetric and V orthogonal


I get the following benchmark results:

julia> @benchmark diagonal_elements!($diag_elem,$A, $V,$zeros(n))
BenchmarkTools.Trial:
memory estimate:  70.27 KiB
allocs estimate:  3490
--------------
minimum time:     1.328 s (0.00% GC)
median time:      1.346 s (0.00% GC)
mean time:        1.345 s (0.00% GC)
maximum time:     1.360 s (0.00% GC)
--------------
samples:          4
evals/sample:     1

julia> @benchmark diagonal_elements2!($diag_elem2,$A, $V) BenchmarkTools.Trial: memory estimate: 61.04 MiB allocs estimate: 4 -------------- minimum time: 221.781 ms (0.00% GC) median time: 240.748 ms (2.60% GC) mean time: 239.897 ms (2.42% GC) maximum time: 257.615 ms (5.49% GC) -------------- samples: 21 evals/sample: 1  Sure, the naive way is using optimized BLAS calls, but I would have expected diagonal_elements! to still be faster, as we are only computing a fraction of the elements of the matrix product. Any ideas / suggestions on how to speed up the first function? The complexity of both methods is O(n^3), and in a battle of constant factors the BLAS is always going to beat you unless you really know what you are doing. (You are only saving a factor of 2 in the number of flops with your hand-rolled method, and the constant-factor advantage of the BLAS is much larger than this.) Instead of multiplying A by the columns of V one by one, why not use BLAS for the matrix–matrix product Y = AV, which does exactly the same O(n^3) work but much faster? Then “manually” compute the dot products of the columns of V with the columns of Y — this is only O(n^2), so it will asymptotically beat the BLAS matrix–matrix product. (Note also that BLAS is multi-threaded by default. I would typically call LinearAlgebra.BLAS.set_num_threads(1) to turn off multi-threading for initial benchmarking. Once you get good serial performance, then try to optimize multi-threading.) As @stevengj maybe there is a better method, but one thing you might try with the current approach is function diagonal_elements(n::Int, A::AbstractMatrix{T}, V::AbstractMatrix{T}, temp::Vector{T}) where {T <: AbstractFloat} diag_elem = Vector{AbstractFloat}(undef, n) @inbounds for i = 1:n vi = view(V, :, i) mul!(temp, A, vi) diag_elem[i] .= dot(vi, temp) end return diag_elem end  Here, even though the array allocation is part of the function, I still get a 25% speed-up on my computer by doing the diagonal element assignment in place with diag_elem[i] .= dot(vi, temp). (edit: note the name change since this is no longer a mutating function) This makes no sense. diag_elem[i] is a scalar, so .= is no more “in-place” than =. Huh, I’m not sure why then, but that way gets the code on my computer from 7.4 seconds to 5.9s In particular, this is about 2x faster than diagonal_elements2! as expected (because it saves one of the matrix–matrix multiplications) for your 2000 \times 2000 test case on my machine: @views function diagonal_elements3!(diag_elem, A, V, Y) mul!(Y, A, V) for i in eachindex(diag_elem) diag_elem[i] = dot(V[:,i], Y[:,i]) end return diag_elem end  (No performance reason to declare the argument types, though for clarity you might declare them as diag_elem::AbstractVector, A::AbstractMatrix, V::AbstractMatrix, Y::AbstractMatrix.) Note that this is also a bad idea in general, because it is an abstractly typed container. Ah, interesting. I did that mainly to mirror the types from the original function, but I didn’t realize that was in the performance tips. I think you want $(zeros(n)), by the way, so that the whole call to zeros happens before benchmarking — the precedence of \$ is higher than that of the function call.

Your posted function doesn’t even run for me (it throws an error) in Julia 1.6, so I can’t replicate this. Maybe you were benchmarking something else by accident?

Yeah, I don’t know. I closed down my old Julia session, and now it’s not working for me now either, and it’s the very line I thought gave the speed up that is causing the issue. I must have had some variables declared that were conflicting/overwriting what I thought I was benchmarking.