Performance of computing the diagonal of V' * A *V

I want to compute the diagonal elements of

X = V^\top A V
  • where V is an orthogonal matrix
  • A is a symmetric matrix

I noticed that it is much faster to just compute X and then extract the diagonal elements than computing only the diagonal elements directly.

using LinearAlgebra, SparseArrays, Random, BenchmarkTools

rng = Random.MersenneTwister(23163213)

# my current approach
function diagonal_elements!(diag_elem::Vector{T}, A::AbstractMatrix{T}, V::AbstractMatrix{T}, temp::Vector{T}) where {T <: AbstractFloat}

    n = size(A, 2)
   @inbounds for i = 1:n
      vi =  view(V, :, i)
      mul!(temp, A, vi)
      diag_elem[i] = dot(vi, temp)
    return nothing

 # naive (but faster?) way
function diagonal_elements2!(diag_elem::Vector{T}, A::AbstractMatrix{T}, V::AbstractMatrix{T}) where {T <: AbstractFloat}
  Λ = V' * A * V
  @inbounds @simd for i in eachindex(diag_elem)
    diag_elem[i] = Λ[i, i]
  return nothing

# make symmetric matrix
n = 2000
A = rand(rng, n, n)
A = 0.5 * (A + A')

# A = Symmetric(A) # seems to slow things down

# make an orthogonal matrix 
V = eigen(Symmetric(rand(rng, n, n))).vectors

# preallocate
diag_elem = zeros(n)
diag_elem2 = zeros(n)

# compute diagonal of V' * A * V where A is symmetric and V orthogonal

I get the following benchmark results:

julia> @benchmark diagonal_elements!($diag_elem, $A, $V, $zeros(n))
  memory estimate:  70.27 KiB
  allocs estimate:  3490
  minimum time:     1.328 s (0.00% GC)
  median time:      1.346 s (0.00% GC)
  mean time:        1.345 s (0.00% GC)
  maximum time:     1.360 s (0.00% GC)
  samples:          4
  evals/sample:     1
julia> @benchmark diagonal_elements2!($diag_elem2, $A, $V)
  memory estimate:  61.04 MiB
  allocs estimate:  4
  minimum time:     221.781 ms (0.00% GC)
  median time:      240.748 ms (2.60% GC)
  mean time:        239.897 ms (2.42% GC)
  maximum time:     257.615 ms (5.49% GC)
  samples:          21
  evals/sample:     1

Sure, the naive way is using optimized BLAS calls, but I would have expected diagonal_elements! to still be faster, as we are only computing a fraction of the elements of the matrix product.
Any ideas / suggestions on how to speed up the first function?

The complexity of both methods is O(n^3), and in a battle of constant factors the BLAS is always going to beat you unless you really know what you are doing. (You are only saving a factor of 2 in the number of flops with your hand-rolled method, and the constant-factor advantage of the BLAS is much larger than this.)

Instead of multiplying A by the columns of V one by one, why not use BLAS for the matrix–matrix product Y = AV, which does exactly the same O(n^3) work but much faster? Then “manually” compute the dot products of the columns of V with the columns of Y — this is only O(n^2), so it will asymptotically beat the BLAS matrix–matrix product.

(Note also that BLAS is multi-threaded by default. I would typically call LinearAlgebra.BLAS.set_num_threads(1) to turn off multi-threading for initial benchmarking. Once you get good serial performance, then try to optimize multi-threading.)

As @stevengj maybe there is a better method, but one thing you might try with the current approach is

function diagonal_elements(n::Int, A::AbstractMatrix{T}, V::AbstractMatrix{T}, temp::Vector{T}) where {T <: AbstractFloat}
    diag_elem = Vector{AbstractFloat}(undef, n)

   @inbounds for i = 1:n
      vi =  view(V, :, i)
      mul!(temp, A, vi)
      diag_elem[i] .= dot(vi, temp)
    return diag_elem

Here, even though the array allocation is part of the function, I still get a 25% speed-up on my computer by doing the diagonal element assignment in place with diag_elem[i] .= dot(vi, temp).

(edit: note the name change since this is no longer a mutating function)

This makes no sense. diag_elem[i] is a scalar, so .= is no more “in-place” than =.

Huh, I’m not sure why then, but that way gets the code on my computer from 7.4 seconds to 5.9s

In particular, this is about 2x faster than diagonal_elements2! as expected (because it saves one of the matrix–matrix multiplications) for your 2000 \times 2000 test case on my machine:

@views function diagonal_elements3!(diag_elem, A, V, Y)
    mul!(Y, A, V)
    for i in eachindex(diag_elem)
        diag_elem[i] = dot(V[:,i], Y[:,i])
    return diag_elem

(No performance reason to declare the argument types, though for clarity you might declare them as diag_elem::AbstractVector, A::AbstractMatrix, V::AbstractMatrix, Y::AbstractMatrix.)

Note that this is also a bad idea in general, because it is an abstractly typed container.

Ah, interesting. I did that mainly to mirror the types from the original function, but I didn’t realize that was in the performance tips.

I think you want $(zeros(n)), by the way, so that the whole call to zeros happens before benchmarking — the precedence of $ is higher than that of the function call.

Your posted function doesn’t even run for me (it throws an error) in Julia 1.6, so I can’t replicate this. Maybe you were benchmarking something else by accident?

Yeah, I don’t know. I closed down my old Julia session, and now it’s not working for me now either, and it’s the very line I thought gave the speed up that is causing the issue. I must have had some variables declared that were conflicting/overwriting what I thought I was benchmarking. :man_shrugging: