Tensor contration efficiency

I saw some benchmark between NumPy and BLAS from C++ regarding matrix-matrix multiplication (a special case of tensor contraction)

it seems NumPy can be slower than C++ by 1-2 orders of magnitude.

I am wondering how about einsum/TensorOperations from Julia

From Julia Micro-Benchmarks
it seems matrix multiplication C and python is similar, which is quite different than the above StackOverflow link. Hence, I would like to see a more comprehensive benchmark, especially comparing with using BLAS directly from C/C++/Fortran and Julia :slight_smile:

Both Julia and numpy call BLAS, so as long as they use the same Blas library, you should not se any difference. Numpy might use MKL by default whereas Julia defaults to openblas (MKL.jl is available though).

1 Like

For a non-BLAS based but highly competitive library, see Tullio.jl


The Readme of GitHub - chriselrod/PaddedMatrices.jl: This library provides arrays with columns padded to be a multiple of SIMD-vector width. has some benchmarks of different Blas-like libraries for matrix multiplication.


Thanks. I think NumPy used other BLAS as default since there are ways to incorporate MKL

In my limited experience, NumPy-MKL/opt-einsum with optimize=True still a factor of 3~10 slower comparing with C++ loaded MKL-BLAS. Maybe I compare small scales, 1e-2~1e-5 s, contraction time cases.