Improve performance of vector-matrix multiplication

uwechsler · November 16, 2020, 9:46pm

In a larger project, I have to compute a (row)vector-matrix multiplication. I guess that in mathematical terms, I would write it as c = x\cdot M^T where x\in \mathbb{R}^{1 \times n}, M^T\in \mathbb{R}^{n \times m} and c\in \mathbb{R}^{1 \times m}. However, in the implementation, x and c are not row-vectors or matrices but arrays.

My current implemenation is:

LinearAlgebra.gemv!(c, 'T', Mᵀ, x)

In my application, the vector-matrix multiplication is done several times.
A minimal example looks like this:

using ThreadsX
using LinearAlgebra
using BenchmarkTools

# Initialize Data
nRows = 50_000
nCols = 15
nVecs = 16
c_vec = [zeros(nRows) for i=1:nVecs]
M_vec = [rand(nRows, nCols) for i=1:nVecs]
Mᵀ_vec = [permutedims(M) for M in M_vec]
x_vec = [rand(nCols) for i=1:nVecs]

function vec_matT_mul(c_vec, x_vec, Mᵀ_vec)
    BLAS.set_num_threads(1)
    ThreadsX.foreach(eachindex(c_vec)) do i
        Mᵀ = Mᵀ_vec[i]
        x = x_vec[i]
        c = c_vec[i]
        # inplace version of c = x ⋅ Mᵀ
        LinearAlgebra.gemv!(c, 'T', Mᵀ, x)
    end
    BLAS.set_num_threads(96)
    return nothing
end
@btime vec_matT_mul($c_vec, $x_vec, $Mᵀ_vec)
#  3.297 ms (128 allocations: 15.27 KiB)

Doing the same thing as Matrix-vector multiplication c = M \cdot x with mul!(c, M, x) takes half as much time (which I cannot do since I need the matrix M transposed to make another part of the code faster to properly use @avx) .

function mat_vec_mul(c_vec, x_vec, M_vec)
    BLAS.set_num_threads(1)
    ThreadsX.foreach(eachindex(c_vec)) do i
        M = M_vec[i]
        x = x_vec[i]
        c = c_vec[i]
        # inplace version of c = M ⋅ x
        mul!(c, M, x)
    end
    BLAS.set_num_threads(96)
    return nothing
end
@btime mat_vec_mul($c_vec, $x_vec, $M_vec)
# 1.760 ms (127 allocations: 15.25 KiB)

Therefore, I was wondering if there is any potential for improvement in the computation of the vector-matrix multiplication or the code in general.

Remarks:

Threads.nthreads() → 24
I tried MKL without success (most likely, the compilation on my Windows machine was not working properly)
@inbounds Threads.@threads was slower than ThreadsX.foreach

Oscar_Smith · November 16, 2020, 9:57pm

Do you have a comparison with another system that does this faster? This is very useful so we know what should be possible on your computer, and potentially to see how that piece of software achieved these results. Thanks!

uwechsler · November 16, 2020, 10:16pm

No, I am sorry. I don’t have a comparison where it did run faster.
So it might be, that there is no faster implemenation on my system

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)

I was just me being curious and greedy, wondering if there is a way to make this faster

The only comparison I have is the “matrix-vector” multiplication which is two times faster which made me wonder if there is some potential.

Otherwise, I know that the MKL performance on my system is really bad compared to a similar Linux machine. But there I don’t know where to even start solving the issue.

Topic		Replies	Views
Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector` Performance blas	2	692	September 27, 2019
Matrix-Vector multiplication complex/real Performance	0	379	March 14, 2021
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3298	August 12, 2020
Matrix vector multiplication Performance question	4	909	September 27, 2020
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	419	November 7, 2022

Improve performance of vector-matrix multiplication

Related topics