I would add that using the Batch Mode we can also create a nice optimization for Broadcasting over the third dimension with Matrix Multiplication (More advanced use case, Jitting Matrix Multiplication which happens in a loop).
Maybe the reason MATLAB gets better performance out of MKL is due to configuration.
It seems you need configure it correctly and pay attention to small tips (Not sure this is the case, but better check the integration).
You are basically talking about master at this point. It is not only that it is not released yet, even what will be in 1.2 is not yet fixed (technically, as of course it is possible to have a pretty good idea).