Multiply many-matrices by many-vectors

I guess this convention came from deep learning frameworks. E.g pytorch has some batched operations, like bmm, which do matrix multiplication for A and B, where A and B are vectors of matrix represented as a rank-3 tensor (in Python they treat the first dimension as batch dimension)

Batched operation here means do the same operation on a batch of objects (matrix, vectors, etc.) This is extremely useful when you do parallel computing or programming some GPU code since the batch dimensions can be easily paralleled.

E.g for batched matrix multiplication, cublas has gemmBatched routine for vectors of matrix, and gemmBatchedStrided for rank-3 tensors.

1 Like

I’ll just mention that you will also get speed-ups by replacing this with

M = rand(SMatrix{N1,N2,Float64,N1*N2}, nMatrix)
V = rand(SVector{N2,Float64}, nVector)
1 Like

This is indeed very slow, but not for the reasons you listed. The quote you pulled from the StaticArrays readme refers to the compilation time which isn’t relevant here.

@rdeits Thanks for point this out.

Instead, that code is slow because it calls zeros(N1) 1,000,000 times, which allocates 1,000,000 Vector{Float64} s, each of which is immediately used to construct an SMatrix and then discarded.
Instead, you can rely on the fact that StaticArrays provides a helpful method for zero() when called on an SVector or SMatrix. That in turn means that you can directly construct a zeros matrix where each element is itself a static array

That would seem like a good discussion to capture in the Quick Start for StaticArrays or in this error message:

N1 = 1000;
N2 = 1000;
A = rand(N1,N2);
StaticArrays.SMatrix(A)
ERROR: The size of type `StaticArrays.SArray{Tuple{S1,S2},T,2,L} where L where T where S2 where S1` is not known.

If you were trying to construct (or `convert` to) a `StaticArray` you
may need to add the size explicitly as a type parameter so its size is
inferrable to the Julia compiler (or performance would be terrible). For
example, you might try

    m = zeros(3,3)
    SMatrix(m)      # this error
    SMatrix{3,3}(m) # correct - size is inferrable

Essentially, I saw the # correct comment and then “plug’n played” the suggestion. It seems it might be useful for that error to instead print:

    m = zeros(3,3)
    SMatrix(m)      # This error
    SMatrix{3,3}(m) # Syntactically correct statement
    zeros(SMatrix{3,3, Float64}) # Correct method for instantiating array of inferrable size

If all your matrices are the same size, and you do store them in a rank 3 tensor, you can do

N1 = 7; # Always
N2 = 8; # Always
nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8
nVector  = 1000; # Usually varies between 1e4 and 2e5
M = rand(N1,N2,nMatrix);
V = rand(N2,nVector);

using TensorOperations
@tensor res[a,b,c] := M[a,x,b]*V[x,c]

e.g.

using BenchmarkTools
@btime (@tensor res[a,b,c] := $M[a,x,b]*$V[x,c]);
  #-> 14.611 ms (14 allocations: 53.83 MiB)

with the current master version of TensorOperations.jl (which will soon be v1.0)

3 Likes

What if we merge @Roger-luo and @juthohaegeman approaches? (Using a rank 3 tensor with BLAS).

I want to thank everyone for the time you’ve spent answering this question, it’s greatly appreciated and I think I’ve learned quite a bit. Admittedly part of the battle I (and likely a bunch of "trying to convert"s) have is trying to identify the useful Julia packages & features, so I especially thank those of you that pointed me to some of these mature packages (e.g. StaticArrays) and some ones to watch for in the future (e.g. BatchedRoutines and TensorOperations).

What I want to do now is help summarize a few of my findings for the future reader. I’ve compiled some of these solutions into a gist on my GitHub account and ran the benchmarks on my 12-CPU machine. I hope to update this comment as I eventually get around to trying out the other solutions provided in this thread.

5 Likes

Are you taking into account the number of threads BLAS is using?

Just prior to the BLAS section I run LinearAlgebra.BLAS.set_num_threads(Threads.nthreads()) which should set the number of threads the BLAS library should use to the value of the JULIA_NUM_THREADS environment value I’ve set for the current benchmark.

Yeah, having a frontend like TensorOperation will be super nice, which make it a lot easier in writing complex expressions. However, batched operations is not exactly the same with einsum notation, e.g

batched_gemm(A, B, C) where A, B, C are batch of matrices, you only need to contract the 2nd dim of A with 1st dim of B and do this for size(A, 3)/size(B, 3)/size(C, 3) times. We don’t have notation for such batch dimension in einsum at the moment.

I’m not sure if @juthohaegeman have any idea on this?

I treated this problem as special case (the rhs vector is a vector of the same vector), this happens quite often when you train a neural network with a set of data (you feed the data as a small batch). Normally, each element will be different (they are usually a large dataset actually)

And hopefully, I could broadcast function based on TensorOperations on batch dimension as well.

Have you seen Jax’s Vmap? GitHub - google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Yeah, thanks for mention this. That’s what I wanted to do. Actually inside their implementation there is a function name called broadcast_batch, which I might use as API, like the broadcast in Base :wink:

1 Like

Can tensoroperations be used to write GPU kernels without messing with low level threads? Here’s an example of that https://research.fb.com/announcing-tensor-comprehensions/

@juthohaegeman

I don’t think it work with GPU kernels at least for the current stable version. I didn’t find any kernel fusion or kernel generator in the implementation. Or maybe I missed something. TVM/TC like feature was discussed several times in Julia, it would be an very interesting approach to implement in Julia, and might be easier to implement.

Currently, no. The new version (which will be v1.0) is radically different (under the hood, not the API) from the previous versions (v0.x) in that TensorOperations.jl will mostly be about the API, i.e. the @tensor macro, and then the high-level implementation thereof for strided arrays.

Before, there were also very low-level implementations for adding two tensors in a permuted way, i.e. generalizing (and speeding up) the permutedims! function of Julia. Now, all of this has been moved out of TensorOperations.jl and actually lives in a package that can do much more general things called Strided.jl.

So TensorOperations.jl just provides an implementation that uses BLAS mul! whenever possible, and the kernel from Strided.jl for all other stuff.

My future goal is to make Strided.jl also have dedicated kernels for GPU arrays. This, however, requires that I first have access to a GPU (I have ordered some), that I learn to work with a GPU, and that I can then write a low-level kernel for the GPU. But once we’re there, TensorOperations.jl should also work with GPUs out of the box.

A more rapid approach might be to just implement the necessary methods (add!, trace! and contract!) for GPUArrays using the primitives that are already available, although last time I checked the permutedims! method had a very simplistic implementation for GPUArrays.

1 Like