Matrix-vector product faster than matrix addition?

I’m running some benchmarks, and to my surprize I’m finding adding two matrices much slower than a matrix-vector multiplication. The relevant benchmark code is this (see link for context):

function benchmark_mv_vs_mpm()

    N = 1000
    H = random_hermitian_matrix(N)
    H0 = random_hermitian_matrix(N)
    Ψ = random_state_vector(N)
    ϕ = random_state_vector(N)
    val = 1.15

    println("*** matrix-vector product ϕ = H Ψ")
    b_mv1 = @benchmark mul!($ϕ, $H, $Ψ)
    display(b_mv1)

    println("*** matrix-vector product ϕ += v H Ψ")
    b_mv2 = @benchmark mul!($ϕ, $H, $Ψ, $val, true)
    display(b_mv2)

    println("*** matrix-vector product ϕ += H Ψ")
    b_mv3 = @benchmark mul!($ϕ, $H, $Ψ, true, true)
    display(b_mv3)

    println("*** matrix-matrix addition H += v H0")
    b_mpm1 = @benchmark axpy!($val, $H0, $H)
    display(b_mpm1)

    println("*** matrix-matrix addition H += H0")
    b_mpm2 = @benchmark axpy!(true, $H0, $H)
    display(b_mpm2)

    println("*** matrix-copy H = H0")
    b_mpm3 = @benchmark copyto!($H, $H0)
    display(b_mpm3)

end

with the initialization routines in a separate file testutils.jl (not that it matters: everything is just standard dense matrices/vectors)

The resulting runtimes are this:

I guess I’ll take the super-fast matrix-vector multiplication, but this just seems a little odd to me. Fundamentally, both operations should scale as N^2. I don’t think I would have expected that even just copying a matrix would be so much slower than doing a matrix-vector product.

Does anybody have any insights into this behavior?

you should expect matrix addition to be roughly 3x slower because you will be bottlenecked by memory bandwidth and adding 2 matrices into a 3rd requires looking at 3x as much memory.

Interesting! So this is true in general? That is in other languages, say, Fortran? (I could test this out, but writing benchmark code for Fortran would probably take me a day ;-))

Yeah. Fundamentally, BLAS1 and BLAS2 for largish matrices are linear in the amount of data, so their runtime will be directly correlated with memory visited.

1 Like

And these are likely Fortran routines: ?axpy

Oh, yeah, definitely! I’m using those in my Fortran code. I’m also using the matrix-vector products instead of matrix-matrix sums in Fortran, but in my Julia prototype, I decided to sum all my operators first, because I didn’t think it would make that much of a difference and it was easier to get started.

Nobody has written BenchmarkTools for Fortran, which makes benchmarking a whole lot more time consuming, and I never thoroughly investigated the two alternatives :wink:

In any case, this is good news, because my Julia code is going to become quite significantly faster! :slight_smile: