In a larger project, I have to compute a (row)vectormatrix multiplication. I guess that in mathematical terms, I would write it as c = x\cdot M^T where x\in \mathbb{R}^{1 \times n}, M^T\in \mathbb{R}^{n \times m} and c\in \mathbb{R}^{1 \times m}. However, in the implementation, x and c are not rowvectors or matrices but arrays.
My current implemenation is:
LinearAlgebra.gemv!(c, 'T', Mᵀ, x)
In my application, the vectormatrix multiplication is done several times.
A minimal example looks like this:
using ThreadsX
using LinearAlgebra
using BenchmarkTools
# Initialize Data
nRows = 50_000
nCols = 15
nVecs = 16
c_vec = [zeros(nRows) for i=1:nVecs]
M_vec = [rand(nRows, nCols) for i=1:nVecs]
Mᵀ_vec = [permutedims(M) for M in M_vec]
x_vec = [rand(nCols) for i=1:nVecs]
function vec_matT_mul(c_vec, x_vec, Mᵀ_vec)
BLAS.set_num_threads(1)
ThreadsX.foreach(eachindex(c_vec)) do i
Mᵀ = Mᵀ_vec[i]
x = x_vec[i]
c = c_vec[i]
# inplace version of c = x ⋅ Mᵀ
LinearAlgebra.gemv!(c, 'T', Mᵀ, x)
end
BLAS.set_num_threads(96)
return nothing
end
@btime vec_matT_mul($c_vec, $x_vec, $Mᵀ_vec)
# 3.297 ms (128 allocations: 15.27 KiB)
Doing the same thing as Matrixvector multiplication c = M \cdot x with mul!(c, M, x)
takes half as much time (which I cannot do since I need the matrix M transposed to make another part of the code faster to properly use @avx
) .
function mat_vec_mul(c_vec, x_vec, M_vec)
BLAS.set_num_threads(1)
ThreadsX.foreach(eachindex(c_vec)) do i
M = M_vec[i]
x = x_vec[i]
c = c_vec[i]
# inplace version of c = M ⋅ x
mul!(c, M, x)
end
BLAS.set_num_threads(96)
return nothing
end
@btime mat_vec_mul($c_vec, $x_vec, $M_vec)
# 1.760 ms (127 allocations: 15.25 KiB)
Therefore, I was wondering if there is any potential for improvement in the computation of the vectormatrix multiplication or the code in general.
Remarks:
Threads.nthreads() → 24
 I tried MKL without success (most likely, the compilation on my Windows machine was not working properly)

@inbounds Threads.@threads
was slower thanThreadsX.foreach