Performance of computing the diagonal of V' * A *V

In particular, this is about 2x faster than diagonal_elements2! as expected (because it saves one of the matrix–matrix multiplications) for your 2000 \times 2000 test case on my machine:

@views function diagonal_elements3!(diag_elem, A, V, Y)
    mul!(Y, A, V)
    for i in eachindex(diag_elem)
        diag_elem[i] = dot(V[:,i], Y[:,i])
    end
    return diag_elem
end

(No performance reason to declare the argument types, though for clarity you might declare them as diag_elem::AbstractVector, A::AbstractMatrix, V::AbstractMatrix, Y::AbstractMatrix.)