How could there be such a huge performance deficit in Julia in matrix addition?
That is an embarrassingly parallel problem. All (

) one has to be careful about is memory access. Is it even possible that the algorithms in MKL and blas could be different, since the matrix can be handled as a vector?