Intel MKL has a batch mode for Matrix Multiplication (See Introducing Batch GEMM Operations).

It seems it has great potential on speeding up small matrix multiplication operations in 2 ways (I can see, probably more):

- Broadcasting

When multiplying 2 2D array with 3D array and broadcasting the Matrix Multiplication operation along the 3rd dimension. - Lazy Evaluation

When many small matrix operation are called in a loop and then they are accumulated and sent as a batch job for the batch mode in MKL.

Is this implemented?

If not, could it be added to 1.x (Of course not 1.0, but on its optimization phase once it is released)?