MKL has some new interesting stuff with the Inspector-executor API (Intel | Data Center Solutions, IoT, and PC Innovation). You give an estimate on how many times you will do an operation and it optimizes the operation for the given matrix.
Out of curiosity, is there any reason that the compiler doesn’t just read y .= A * x to mean mul!(y,A,x)? Is there some case in which the user doesn’t want the same thing to happen? (For the elements of y to reflect A*x.)