In particular, this is about 2x faster than diagonal_elements2! as expected (because it saves one of the matrix–matrix multiplications) for your 2000 \times 2000 test case on my machine:
@views function diagonal_elements3!(diag_elem, A, V, Y)
mul!(Y, A, V)
for i in eachindex(diag_elem)
diag_elem[i] = dot(V[:,i], Y[:,i])
end
return diag_elem
end
(No performance reason to declare the argument types, though for clarity you might declare them as diag_elem::AbstractVector, A::AbstractMatrix, V::AbstractMatrix, Y::AbstractMatrix.)