I saw some benchmark between NumPy and BLAS from C++ regarding matrix-matrix multiplication (a special case of tensor contraction)
it seems NumPy can be slower than C++ by 1-2 orders of magnitude.
I am wondering how about einsum/TensorOperations from Julia
From Julia Micro-Benchmarks
it seems matrix multiplication C and python is similar, which is quite different than the above StackOverflow link. Hence, I would like to see a more comprehensive benchmark, especially comparing with using BLAS directly from C/C++/Fortran and Julia
Both Julia and numpy call BLAS, so as long as they use the same Blas library, you should not se any difference. Numpy might use MKL by default whereas Julia defaults to openblas (MKL.jl is available though).
Thanks. I think NumPy used other BLAS as default since there are ways to incorporate MKL
In my limited experience, NumPy-MKL/opt-einsum with optimize=True still a factor of 3~10 slower comparing with C++ loaded MKL-BLAS. Maybe I compare small scales, 1e-2~1e-5 s, contraction time cases.