We’re still a factor 1000x away from the numpy version, which can hardly be explained by the difference in BLAS
Notice that the only BLAS call in my julia version should be the mul! for the matrix multiply - and since the matrices are tiny, there’s not going to be much of a difference between OpenBlas vs MKL.