Multiply many-matrices by many-vectors




I have a critical portion of code that essentially involves the multiplication of many pre-computed matrices by many pre-computed vectors. Each matrix multiplies the same, entire list of vectors. Each matrix is the same size (N1xN2) in my current form, but eventually I’d like to remove this requirement, so that each matrix-vector combination can have arbitrary size. In Matlab I might think of this as a cell-array of matrices and a cell array of vectors. But for now I’m just trying to understand performant Julia on this “array of matrices” case.


M(N1,N2,nMatrix) x D(N2,nVector) ==> R(N1,nVector,nMatrix)


## Julia 1.0.2
N1 = 7; # Always
N2 = 8; # Always
nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8
nVector  = 1000; # Usually varies between 1e4 and 2e5
M = rand(N1,N2,nMatrix);
V = rand(N2,nVector);

function Naive(M,V)
    N1 = size(M,1);
    nMatrix = size(M,3);
    nVector = size(V,2);
    R = zeros(N1,nVector ,nMatrix);
    for i = 1:nMatrix   # Iterate through the matrices
        for j = 1:nVector  # Iterate through the vectors
            R[:,j,i] = M[:,:,i] * V[:,j];
    return R

function myBest(M,V)
    N1 = size(M,1);
    nMatrix = size(M,3);
    nVector = size(V,2);
    R = zeros(N1,nVector ,nMatrix);
    for i = 1:nMatrix   # Iterate through the matrices
        R[:,:,i] = M[:,:,i] * V;  # Matrix-Matrix multiplication is equivalent to naive approach
    return R

Benchmarking these:

import BenchmarkTools
BenchmarkTools.@btime Naive(M,V);
  #389.538 ms (3000002 allocations: 846.86 MiB)

BenchmarkTools.@btime myBest(M,V);
  #64.830 ms (3002 allocations: 107.41 MiB)

Which isn’t too shabby; While not quite as fast as single-threaded Matlab 2018b, we’re on the same order of magnitude - so I chalk this up to different math libraries:

Matlab 2018b Comparison

LASTN = maxNumCompThreads(1);
CURRN = maxNumCompThreads();
disp(['Number of Computational Threads = ' num2str(CURRN)])
# Number of Computational Threads = 1

function R = myBest_Matlab(M,V)
    N1 = size(M,1);
    nMatrix = size(M,3);
    nVector = size(V,2);
    R = zeros(N1,nVector,nMatrix);
    for i = 1:nMatrix
        R(:,:,i) = M(:,:,i) * V;

f = @() myBest_Matlab(M,V);
t = timeit(f,1);
disp(['Time Elapsed = ' num2str(t * 1e3) ' ms'])
# Time Elapsed = 49.5614 ms

Are there any thoughts on this? As a long-time Matlab user, that’s the Julia that seems most natural to me… but are there other things I’ve missed… preferably without completely obfuscating my code?

Follow-up question: Parallel-computing
Assuming that I have mostly maxed out performance of Julia, I’ve looked into parallel computations. I have a 12 physical-core workstation (and access to a 72-core node) and it sure would seem like I could parallelize across the matrices. But when I naively try to use multi-threading:

import Base.Threads
function myNaiveThreaded(M,V)
    N1 = size(M,1);
    nMatrix = size(M,3);
    nVector = size(V,2);
    R = zeros(N1,nVector ,nMatrix);
    Threads.@threads for i = 1:nMatrix   # Iterate through the matrices
        R[:,:,i] = M[:,:,i] * V;  # Matrix-Matrix multiplication is equivalent to naive approach
    return R

BenchmarkTools.@btime myNaiveThreaded(M,V);
#34.264 ms (2976 allocations: 106.60 MiB)

And tracking the julia.exe process in task manager shows >14GB RAM utilization (128GB available), plus @btime takes about a minute to run. So not only is the code not “6-threads = ~6x faster” but it’s eating up alot of RAM too. As mentioned in my comments on the first Julia code, the total number of matrices is usually very large (1e4 - 1e8) so I think I’d rather use shared memory so that I don’t have nProc * sizeof(M) from multiple processes with their own copies of the data. Thoughts here?


My first tip would be use @views to avoid allocation in array slicing.

        @views R[:,:,i] = M[:,:,i] * V;

However, I am not sure if this works in Threads.

EDIT: Nah, it is not helping in your example. Sorry.


It would more natural to simply have an array of matrices and an array of vectors for this, rather than using a 3d array, especially if you want the matrices to be different sizes. If you are doing this over and over, you probably want to pre-allocate the output vectors and use mul! to do the matvecs with preallocated output.

For example,

using LinearAlgebra
function doit!(M, V, R)
    size(R) == (length(V),length(M)) || throw(BoundsError())
    Threads.@threads for i = 1:length(M)
        for j = 1:length(V)
            @inbounds mul!(R[j,i], M[i], V[j])
    return R

using BenchmarkTools

M = [rand(7,8) for i=1:1000];
V = [rand(8) for j=1:2000];
R = [zeros(7) for j=1:2000, i=1:1000];
@btime doit!(M, V, R);

On my laptop, I get 138.670 ms (1 allocation: 48 bytes) with JULIA_NUM_THREADS=1. With JULIA_NUM_THREADS=2, I get 73.856 ms (1 allocation: 48 bytes), and with 4 threads I get 39.231 ms (1 allocation: 48 bytes).


You probably want to use the mighty GEMM. Your job can be expressed as a single matrix-matrix multiplication if you use the proper memory layout.

julia> N1 = 7; # Always
julia> N2 = 8; # Always
julia> nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8
julia> nVector  = 1000; # Usually varies between 1e4 and 2e5
julia> V= rand(N2, nVector);
julia> M=rand(N1*nMatrix , N2);
julia> @time res=M*V;
  0.080671 seconds (6 allocations: 53.406 MiB, 77.77% gc time)

compared to:

julia> N1 = 7; # Always
julia> N2 = 8; # Always
julia> nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8
julia> nVector  = 1000; # Usually varies between 1e4 and 2e5
julia> M = rand(N1,N2,nMatrix);
julia> V = rand(N2,nVector);
julia> function myBest(M,V)
           N1 = size(M,1);
           nMatrix = size(M,3);
           nVector = size(V,2);
           R = zeros(N1,nVector ,nMatrix);
           for i = 1:nMatrix   # Iterate through the matrices
               R[:,:,i] = M[:,:,i] * V;  # Matrix-Matrix multiplication is equivalent to naive approach
           return R
julia> @time myBest(M,V);
  0.132145 seconds (3.01 k allocations: 107.407 MiB, 46.73% gc time)

Times are post warm-up.


Eventually he says he wants all of the matrices to be different sizes, in which case this won’t work.


@stevengj Amazing, thank you! Scales almost perfectly to the 10 threads I’ve benchmarked over. I’ve got some more reading/testing to do as mul! (that is to say: in-place operations) is a new way of thinking for me, definitely adding that to the toolbox. Thanks for addressing the matrices of different sizes portion, especially as this is my ultimate goal.

Quick question, your function doit! ends with ! – is that simply naming convention for functions that operate “in-place” or does that specifically tell Julia / LLVM that in-place operations are about to occur?

Links I’ve found & read based on your answer, for the future reader:


Yes, but typically you have the variable being mutated as the first argument.


N1 = 7; # Always
N2 = 8; # Always

Maybe take a look at also.


Sure it will work.

OP wants to apply a bunch of linear operators on a single vector. Take the direct sum of the codomains of all the linear operators you want to apply; then you just need to apply a single large linear operator to a single vector. Now OP has multiple vectors; make it a matrix and it is a general matrix-matrix product.

Perfect exercise for a linear algebra course (universal algebra definition of product), I’m gonna steal this :slight_smile:

That being said, OP only contracts N2=8 dimensions. Specializing the code to this number might be a large advantage, as @antoine-levitt pointed out (fully unroll the N2 loop during matrix product), as long as the cache is well-used (meaning: recursive subdivision / hilbert curve for the order of N2-dimensional dot products).

PS. We even have a function for the cartesian product / direct sum: Your matrix M is vcat(Ms...).
But you probably need to build this matrix by hand (vcat for a splatted giant list of matrices could be ugly).


For reference @foobar_lv2, as it might be useful for your linear algebra course :wink:, this problem comes from FEA. Each matrix represents an element, the N2 = 8 being the number of nodes for a linear hex element. So when I say that N2 = 8 # Always that’s a bit of a lie as one could have generated a mixed-element/order mesh. Using standard Lagrange basis functions, possible values for each matrice’s N2 could be (not exhaustive):

  • N2 = 4 # Linear Tet
  • N2 = 10 # Quadratic Tet
  • N2 = 8 # Linear Hex
  • N2 = 27 # Quadratic Hex
  • N2 = 6 # Linear Wedge
  • N2 = 21 # Quadratic Wedge
  • N2 = 5 # Linear Pyramid
  • N2 = 14 # Quadratic Pyramid

And in a mixed-element/order setting the matrices could either be broken up by their type and do your GEMM solution on each, or operated on as a single list of matrices/vectors as @stevengj suggested.

I’ll certainly give your GEMM suggestion some testing today, but I’m curious as to the performance difference between yours and @stevengj’s solution (when his is fully-threaded). I assume this is because I’d need to do something akin to LinearAlgebra.BLAS.set_num_threads to enable threading for GEMM?


If you are considering local element assembly here, then definitely look at staticArrays


Thanks for the tip. I’ve explored this within a similar framework as the one @stevengj suggested and, while the function’s runtime does improve significantly (~5x) , any improvement is vastly outshadowed by the increased cost to instantiate the StaticArrays prior to the function call. From the StaticArrays site

Note that in the current implementation, working with large StaticArray s puts a lot of stress on the compiler, and becomes slower than Base.Array as the size increases. A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.

When applying StaticArrays to an approach inspired by @stevengj:

import StaticArrays

function ThreadedStaticArrays(M,V,R)
    size(R) == (length(V),length(M)) || throw(BoundsError())
    Threads.@threads for i = 1:length(M)
        for j = 1:length(V)
            @inbounds R[j,i] = M[i] * V[j]; # mul! not defined for StaticArrays
    return R

N1 = 7
N2 = 8
nMatrix = 1000
nVector = 1000
M = [StaticArrays.SMatrix{N1,N2}(rand(N1,N2)) for i=1:nMatrix]
V = [StaticArrays.SMatrix{N2,1}(rand(N2)) for j=1:nVector]
R = [StaticArrays.SMatrix{N1,1}(zeros(N1)) for j = 1:nVector, i = 1:nMatrix]
R = ThreadedStaticArrays(M,V,R);

Benchmarking the last two lines:

BenchmarkTools.@btime R = [StaticArrays.SMatrix{N1,1}(zeros(N1)) for j = 1:nVector, i = 1:nMatrix]
# 3.457 s
BenchmarkTools.@btime R = ThreadedStaticArrays(M,V,R);
# 1.612 ms

As the nVectors and nMatrices increase, the second to last line quickly becomes intractable.


This is indeed very slow, but not for the reasons you listed. The quote you pulled from the StaticArrays readme refers to the compilation time which isn’t relevant here.

Instead, that code is slow because it calls zeros(N1) 1,000,000 times, which allocates 1,000,000 Vector{Float64}s, each of which is immediately used to construct an SMatrix and then discarded. That’s a huge amount of unnecessary work.

Instead, you can rely on the fact that StaticArrays provides a helpful method for zero() when called on an SVector or SMatrix. That in turn means that you can directly construct a zeros matrix where each element is itself a static array:

julia> @btime zeros(SVector{$N1, Float64}, $nVector, $nMatrix)
  14.826 ms (18 allocations: 53.41 MiB)


And, by the way, constructing a matrix of SVectors in this way is almost exactly as fast as just constructing the 3D array of the same total size:

julia> @btime zeros($N1, $nVector, $nMatrix)
  14.093 ms (2 allocations: 53.41 MiB)

And that makes sense, because the memory layout of those two data structures is identical; the only difference is how that memory is interpreted.


This is related to batched operations. I’m actually working on something related, check these two work-in-progress package:

This package is trying to provide a family of types for batched arrays, this is extremely useful when you want to use batched operations in other libraries, but you don’t write any extra code. (e.g It will just work with those AD packages)

This package is trying to implement batched routines for both CPU and GPU

Unfortunately, I’m still working on this slowly, so you might not be about to use them yet (it’s not stable and haven’t cover much functions in BLAS/LAPACK). But whoever interested in this, please file issues.

I’m currently considering to implement a general batch broadcaster which is similar to builtin broadcast, but it will only broadcast on those batch dimension. But I’m quite busy recently, not sure when I’ll have more time on this.

Tips & Conclusion

As shown below, my implementation is about 4x faster than your best results (I don’t have MATLAB 2018b, but I guess this will be faster than MATLAB), here are some tips:

  1. use rank-3 tensor instead of an array of matrix/vector, why? in this way we can assume they are contiguous on physical memory and get rid of the views / copys. (I’m not sure cache-efficency is important for batch since each matrix’s memory address is pretty far to each other)

  2. be careful when use @thread, BLAS will spawn threads via openmpas well, there no need to use thread here. (but necessary when you have small matrix)

Implementation and Benchmark Results

(to be fair, some of your implementation is re-written as in-place methods)

The following is a quite straight forward implementation and I compare it with methods above:

for (gemm, elty) in
    @eval begin
        function batched_gemm_constant_rhs!(transA::AbstractChar, transB::AbstractChar, alpha::($elty), A::AbstractArray{$elty, 3}, B::AbstractArray{$elty, 2}, beta::($elty), C::AbstractArray{$elty, 3})
            @assert !BLAS.has_offset_axes(A, B, C)
            @assert size(A, 3) == size(C, 3) "batch size mismatch"
            m = size(A, transA == 'N' ? 1 : 2)
            ka = size(A, transA == 'N' ? 2 : 1)
            kb = size(B, transB == 'N' ? 1 : 2)
            n = size(B, transB == 'N' ? 2 : 1)
            if ka != kb || m != size(C,1) || n != size(C,2)
                throw(DimensionMismatch("A has size ($m,$ka), B has size ($kb,$n), C has size $(size(C))"))

            ptrA = Base.unsafe_convert(Ptr{$elty}, A)
            ptrB = Base.unsafe_convert(Ptr{$elty}, B)
            ptrC = Base.unsafe_convert(Ptr{$elty}, C)

            for k in 1:size(A, 3)
                ccall((BLAS.@blasfunc($gemm), BLAS.libblas), Cvoid,
                    (Ref{UInt8}, Ref{UInt8}, Ref{BLAS.BlasInt}, Ref{BLAS.BlasInt},
                     Ref{BLAS.BlasInt}, Ref{$elty}, Ptr{$elty}, Ref{BLAS.BlasInt},
                     Ptr{$elty}, Ref{BLAS.BlasInt}, Ref{$elty}, Ptr{$elty},
                     transA, transB, m, n,
                     ka, alpha, ptrA, max(1,stride(A,2)),
                     ptrB, max(1,stride(B,2)), beta, ptrC,

                ptrA += size(A, 1) * size(A, 2) * sizeof($elty)
                ptrC += size(C, 1) * size(C, 2) * sizeof($elty)

        function batched_gemm_constant_rhs(transA::AbstractChar, transB::AbstractChar, alpha::($elty), A::AbstractArray{$elty, 3}, B::AbstractArray{$elty, 3})
            batched_gemm_constant_rhs!(transA, transB, alpha, A, B, zero($elty), similar(B, $elty, (size(A, transA == 'N' ? 1 : 2), size(B, transB == 'N' ? 2 : 1), size(B, 3))))
        function batched_gemm_constant_rhs(transA::AbstractChar, transB::AbstractChar, A::AbstractArray{$elty, 3}, B::AbstractArray{$elty, 3})
            batched_gemm_constant_rhs(transA, transB, one($elty), A, B)


julia> @benchmark batched_gemm_constant_rhs!('N', 'N', 1.0, $(rand(7, 8, 1000)), $(rand(8, 1000)), 0.0, $(zeros(7, 1000, 1000)))
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     5.068 ms (0.00% GC)
  median time:      5.722 ms (0.00% GC)
  mean time:        5.776 ms (0.00% GC)
  maximum time:     13.589 ms (0.00% GC)
  samples:          864
  evals/sample:     1

And comparing to other implementation:


# use last dimension as batch dimension
batch_size(X::AbstractArray{T, 3}) where T = size(X, 3)
batch_size(X::AbstractArray{T, 2}) where T = size(X, 2)

function naive_gemv!(R::AbstractArray{T, 3}, M::AbstractArray{T, 3}, V::AbstractArray{T, 2}) where T
    n = batch_size(M)
    m = batch_size(V)
    for k in 1:n, l in 1:m
        R[:, l, k] = M[:, :, k] * V[:, l]
    return R

function greg_best_gemv!(R::AbstractArray{T, 3}, M::AbstractArray{T, 3}, V::AbstractArray{T, 2}) where T
    n = batch_size(M)
    for i in 1:n
        R[:, :, i] = M[:, :, i] * V
    return R

Benchmark Results:


julia> @benchmark naive_gemv!($(zeros(7, 1000, 1000)), $(rand(7, 8, 1000)), $(rand(8, 1000)))
  memory estimate:  793.46 MiB
  allocs estimate:  3000000
  minimum time:     303.710 ms (0.00% GC)
  median time:      315.847 ms (14.86% GC)
  mean time:        368.483 ms (23.48% GC)
  maximum time:     720.253 ms (56.14% GC)
  samples:          14
  evals/sample:     1

your best:

julia> @benchmark greg_best_gemv!($(zeros(7, 1000, 1000)), $(rand(7, 8, 1000)), $(rand(8, 1000)))
  memory estimate:  54.00 MiB
  allocs estimate:  3000
  minimum time:     21.033 ms (21.01% GC)
  median time:      27.660 ms (15.55% GC)
  mean time:        35.622 ms (21.81% GC)
  maximum time:     361.027 ms (90.67% GC)
  samples:          141
  evals/sample:     1

@stevengj 's:

function stevengj_gemm!(R, M, V)
    size(R) == (length(V),length(M)) || throw(BoundsError())
    Threads.@threads for i = 1:length(M)
            for j = 1:length(V)
                @inbounds mul!(R[j,i], M[i], V[j])
    return R


M = [rand(7, 8) for i in 1:1000]
V = [rand(8) for j in 1:1000]
R = [zeros(7) for j in 1:1000, i in 1:1000]


julia> Threads.nthreads()

julia> @benchmark stevengj_gemm!(R, M, V)
  memory estimate:  48 bytes
  allocs estimate:  1
  minimum time:     24.061 ms (0.00% GC)
  median time:      29.095 ms (0.00% GC)
  mean time:        30.410 ms (0.00% GC)
  maximum time:     45.613 ms (0.00% GC)
  samples:          165
  evals/sample:     1

Hope this helps.


BTW, keep in mind that an Nx1 Matrix is not the same as a length N Vector. The difference may be important in some contexts.


Actually, the contents of R is simply overwritten, so there is no need to zero it out. Instead, we can do

R = Matrix{SVector{N1, Float64}}(undef, nVector, nMatrix)


julia> @btime zeros(SVector{$N1, Float64}, $nVector, $nMatrix);
  10.697 ms (18 allocations: 53.41 MiB)

julia> @btime Matrix{SVector{$N1, Float64}}(undef, $nVector, $nMatrix);
  265.654 μs (18 allocations: 53.41 MiB)

This is consistent with normal array construction:

julia> @btime zeros($N1, $nVector, $nMatrix);
  8.963 ms (2 allocations: 53.41 MiB)

julia> @btime Array{Float64,3}(undef, $N1, $nVector, $nMatrix);
  272.542 μs (2 allocations: 53.41 MiB)

Compared to the original code:

jl> @btime [StaticArrays.SMatrix{$N1,1}(zeros($N1)) for j = 1:$n
Vector, i = 1:$nMatrix];
  3.004 s (33000005 allocations: 2.12 GiB)

that’s more than 11,000 times speed-up(!)


@Roger-luo your solution is great. I guess a MKL solution would be even faster.
Unfortunately for most users like me creating that code is quite difficult. It would be great that Julia automatically made possible a naive code was almost as fast as yours.


I’m using MKL on my Mac. Yeah, I know this is quite complicated.

That’s why I’m writing those two packages. Hopefully, at some point, people can broadcast across batch dimensions just like what they do with normal broadcast in Julia, which will be super simple.


It’s a bit unclear to me what ‘batched dimensions’ are, or even batched operations, and so on. I looked at your packages, but still don’t understand. Could you briefly explain?