Optimization Based on Intel MKL Matrix Multiplication Batch Mode



Intel MKL has a batch mode for Matrix Multiplication (See Introducing Batch GEMM Operations).

It seems it has great potential on speeding up small matrix multiplication operations in 2 ways (I can see, probably more):

  1. Broadcasting
    When multiplying 2 2D array with 3D array and broadcasting the Matrix Multiplication operation along the 3rd dimension.
  2. Lazy Evaluation
    When many small matrix operation are called in a loop and then they are accumulated and sent as a batch job for the batch mode in MKL.

Is this implemented?
If not, could it be added to 1.x (Of course not 1.0, but on its optimization phase once it is released)?


This seems to be basically a thin wrapper around threading. It should be supported by good support for easy threading in Julia rather than at the level of MKL.


Indeed if the whole magic is via Multi Threading it should be don in Julia level.
Though it might be less overhead to call one C function instead of multi calls, no?

Unless small matrices will be treated by JuliaBLAS and then engine which will Multi Threaded those operations as above will be the best choice.


I don’t think the overhead of calling C is significant. And if you’re doing very small matrices, you’re probably better off with StaticArrays anyway, which bypasses BLAS.


TensorOperations.jl maps some tensor operations to BLAS calls already, so you can take advantage of MKL.