I’m running some benchmarks, and to my surprize I’m finding adding two matrices *much* slower than a matrix-vector multiplication. The relevant benchmark code is this (see link for context):

```
function benchmark_mv_vs_mpm()
N = 1000
H = random_hermitian_matrix(N)
H0 = random_hermitian_matrix(N)
Ψ = random_state_vector(N)
ϕ = random_state_vector(N)
val = 1.15
println("*** matrix-vector product ϕ = H Ψ")
b_mv1 = @benchmark mul!($ϕ, $H, $Ψ)
display(b_mv1)
println("*** matrix-vector product ϕ += v H Ψ")
b_mv2 = @benchmark mul!($ϕ, $H, $Ψ, $val, true)
display(b_mv2)
println("*** matrix-vector product ϕ += H Ψ")
b_mv3 = @benchmark mul!($ϕ, $H, $Ψ, true, true)
display(b_mv3)
println("*** matrix-matrix addition H += v H0")
b_mpm1 = @benchmark axpy!($val, $H0, $H)
display(b_mpm1)
println("*** matrix-matrix addition H += H0")
b_mpm2 = @benchmark axpy!(true, $H0, $H)
display(b_mpm2)
println("*** matrix-copy H = H0")
b_mpm3 = @benchmark copyto!($H, $H0)
display(b_mpm3)
end
```

with the initialization routines in a separate file `testutils.jl`

(not that it matters: everything is just standard dense matrices/vectors)

The resulting runtimes are this:

I guess I’ll take the super-fast matrix-vector multiplication, but this just seems a little odd to me. Fundamentally, both operations should scale as N^2. I don’t think I would have expected that even just copying a matrix would be so much slower than doing a matrix-vector product.

Does anybody have any insights into this behavior?