Improve performance of vector-matrix multiplication

In a larger project, I have to compute a (row)vector-matrix multiplication. I guess that in mathematical terms, I would write it as c = x\cdot M^T where x\in \mathbb{R}^{1 \times n}, M^T\in \mathbb{R}^{n \times m} and c\in \mathbb{R}^{1 \times m}. However, in the implementation, x and c are not row-vectors or matrices but arrays.

My current implemenation is:

LinearAlgebra.gemv!(c, 'T', Mᵀ, x)

In my application, the vector-matrix multiplication is done several times.
A minimal example looks like this:

using ThreadsX
using LinearAlgebra
using BenchmarkTools

# Initialize Data
nRows = 50_000
nCols = 15
nVecs = 16
c_vec = [zeros(nRows) for i=1:nVecs]
M_vec = [rand(nRows, nCols) for i=1:nVecs]
Mᵀ_vec = [permutedims(M) for M in M_vec]
x_vec = [rand(nCols) for i=1:nVecs]

function vec_matT_mul(c_vec, x_vec, Mᵀ_vec)
    BLAS.set_num_threads(1)
    ThreadsX.foreach(eachindex(c_vec)) do i
        Mᵀ = Mᵀ_vec[i]
        x = x_vec[i]
        c = c_vec[i]
        # inplace version of c = x ⋅ Mᵀ
        LinearAlgebra.gemv!(c, 'T', Mᵀ, x)
    end
    BLAS.set_num_threads(96)
    return nothing
end
@btime vec_matT_mul($c_vec, $x_vec, $Mᵀ_vec)
#  3.297 ms (128 allocations: 15.27 KiB)

Doing the same thing as Matrix-vector multiplication c = M \cdot x with mul!(c, M, x) takes half as much time (which I cannot do since I need the matrix M transposed to make another part of the code faster to properly use @avx) .

function mat_vec_mul(c_vec, x_vec, M_vec)
    BLAS.set_num_threads(1)
    ThreadsX.foreach(eachindex(c_vec)) do i
        M = M_vec[i]
        x = x_vec[i]
        c = c_vec[i]
        # inplace version of c = M ⋅ x
        mul!(c, M, x)
    end
    BLAS.set_num_threads(96)
    return nothing
end
@btime mat_vec_mul($c_vec, $x_vec, $M_vec)
# 1.760 ms (127 allocations: 15.25 KiB)

Therefore, I was wondering if there is any potential for improvement in the computation of the vector-matrix multiplication or the code in general.

Remarks:

  • Threads.nthreads() → 24
  • I tried MKL without success (most likely, the compilation on my Windows machine was not working properly)
  • @inbounds Threads.@threads was slower than ThreadsX.foreach

Do you have a comparison with another system that does this faster? This is very useful so we know what should be possible on your computer, and potentially to see how that piece of software achieved these results. Thanks!

1 Like

No, I am sorry. I don’t have a comparison where it did run faster.
So it might be, that there is no faster implemenation on my system

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)

I was just me being curious and greedy, wondering if there is a way to make this faster :wink:

The only comparison I have is the “matrix-vector” multiplication which is two times faster which made me wonder if there is some potential.

Otherwise, I know that the MKL performance on my system is really bad compared to a similar Linux machine. But there I don’t know where to even start solving the issue.