Tensor contration efficiency

I saw some benchmark between NumPy and BLAS from C++ regarding matrix-matrix multiplication (a special case of tensor contraction)
https://stackoverflow.com/questions/7596612/benchmarking-python-vs-c-using-blas-and-numpy
it seems NumPy can be slower than C++ by 1-2 orders of magnitude.

I am wondering how about einsum/TensorOperations from Julia

From Julia Micro-Benchmarks
it seems matrix multiplication C and python is similar, which is quite different than the above StackOverflow link. Hence, I would like to see a more comprehensive benchmark, especially comparing with using BLAS directly from C/C++/Fortran and Julia :slight_smile:

Both Julia and numpy call BLAS, so as long as they use the same Blas library, you should not se any difference. Numpy might use MKL by default whereas Julia defaults to openblas (MKL.jl is available though).

1 Like

For a non-BLAS based but highly competitive library, see Tullio.jl

2 Likes

The Readme of https://github.com/chriselrod/PaddedMatrices.jl has some benchmarks of different Blas-like libraries for matrix multiplication.

2 Likes

Thanks. I think NumPy used other BLAS as default since there are ways to incorporate MKL
https://software.intel.com/content/www/us/en/develop/articles/build-numpy-with-mkl-and-icc.html

In my limited experience, NumPy-MKL/opt-einsum with optimize=True still a factor of 3~10 slower comparing with C++ loaded MKL-BLAS. Maybe I compare small scales, 1e-2~1e-5 s, contraction time cases.