Matrix-vector product faster than matrix addition?

goerz · September 28, 2022, 9:01pm

I’m running some benchmarks, and to my surprize I’m finding adding two matrices much slower than a matrix-vector multiplication. The relevant benchmark code is this (see link for context):

function benchmark_mv_vs_mpm()

    N = 1000
    H = random_hermitian_matrix(N)
    H0 = random_hermitian_matrix(N)
    Ψ = random_state_vector(N)
    ϕ = random_state_vector(N)
    val = 1.15

    println("*** matrix-vector product ϕ = H Ψ")
    b_mv1 = @benchmark mul!($ϕ, $H, $Ψ)
    display(b_mv1)

    println("*** matrix-vector product ϕ += v H Ψ")
    b_mv2 = @benchmark mul!($ϕ, $H, $Ψ, $val, true)
    display(b_mv2)

    println("*** matrix-vector product ϕ += H Ψ")
    b_mv3 = @benchmark mul!($ϕ, $H, $Ψ, true, true)
    display(b_mv3)

    println("*** matrix-matrix addition H += v H0")
    b_mpm1 = @benchmark axpy!($val, $H0, $H)
    display(b_mpm1)

    println("*** matrix-matrix addition H += H0")
    b_mpm2 = @benchmark axpy!(true, $H0, $H)
    display(b_mpm2)

    println("*** matrix-copy H = H0")
    b_mpm3 = @benchmark copyto!($H, $H0)
    display(b_mpm3)

end

with the initialization routines in a separate file testutils.jl (not that it matters: everything is just standard dense matrices/vectors)

The resulting runtimes are this:

I guess I’ll take the super-fast matrix-vector multiplication, but this just seems a little odd to me. Fundamentally, both operations should scale as N^2. I don’t think I would have expected that even just copying a matrix would be so much slower than doing a matrix-vector product.

Does anybody have any insights into this behavior?

Oscar_Smith · September 28, 2022, 9:11pm

you should expect matrix addition to be roughly 3x slower because you will be bottlenecked by memory bandwidth and adding 2 matrices into a 3rd requires looking at 3x as much memory.

goerz · September 28, 2022, 9:13pm

Interesting! So this is true in general? That is in other languages, say, Fortran? (I could test this out, but writing benchmark code for Fortran would probably take me a day ;-))

Oscar_Smith · September 28, 2022, 9:22pm

Yeah. Fundamentally, BLAS1 and BLAS2 for largish matrices are linear in the amount of data, so their runtime will be directly correlated with memory visited.

lmiq · September 28, 2022, 10:58pm

And these are likely Fortran routines: ?axpy

goerz · September 28, 2022, 11:04pm

Oh, yeah, definitely! I’m using those in my Fortran code. I’m also using the matrix-vector products instead of matrix-matrix sums in Fortran, but in my Julia prototype, I decided to sum all my operators first, because I didn’t think it would make that much of a difference and it was easier to get started.

Nobody has written BenchmarkTools for Fortran, which makes benchmarking a whole lot more time consuming, and I never thoroughly investigated the two alternatives

In any case, this is good news, because my Julia code is going to become quite significantly faster!

Topic		Replies	Views
High performance vector/matrix/tensor linear algebra operations Performance question , performance , linearalgebra	9	543	January 20, 2023
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Matrix-vector multiplication slower than a 'naive' for loop? Performance vector	7	1656	July 30, 2020
Vector - Matrix - Vector multiplication Performance	19	4236	March 14, 2021
Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector` Performance blas	2	692	September 27, 2019

Matrix-vector product faster than matrix addition?

Related topics