I am wondering how about einsum/TensorOperations from Julia
From Julia Micro-Benchmarks
it seems matrix multiplication C and python is similar, which is quite different than the above StackOverflow link. Hence, I would like to see a more comprehensive benchmark, especially comparing with using BLAS directly from C/C++/Fortran and Julia
Both Julia and numpy call BLAS, so as long as they use the same Blas library, you should not se any difference. Numpy might use MKL by default whereas Julia defaults to openblas (MKL.jl is available though).
In my limited experience, NumPy-MKL/opt-einsum with optimize=True still a factor of 3~10 slower comparing with C++ loaded MKL-BLAS. Maybe I compare small scales, 1e-2~1e-5 s, contraction time cases.