using TensorOperations
using LinearAlgebra
using BenchmarkTools
using Tullio
A = rand(1000, 1000)
B = rand(1000, 1000)
@btime @tensor C[i,k] := A[i,j]*B[j,k]
@btime @tullio C[i,k] := A[i,j]*B[j,k]

I use export JULIA_NUM_THREADS=1 and export JULIA_NUM_THREADS=6 to set up parallel, I got

22.056 ms (3 allocations: 7.63 MiB)
731.924 ms (2 allocations: 7.63 MiB)

and

22.146 ms (3 allocations: 7.63 MiB)
129.130 ms (77 allocations: 7.63 MiB)

there is not an acceleration in @tensor. @tullio there is acceleration, but much slower than @tensor. Is there any way to speed up @tensor or using other package?

using TensorOperations
using LinearAlgebra
using LoopVectorization
using BenchmarkTools
function mat_mul_1(A, B)
@tensor C[i,k] := A[i,j]*B[j,k]
return C
end
function mat_mul_2(A, B)
C = A * B
return C
end
@btime mat_mul_1(A, B) setup=(A = rand(1000, 1000); B = rand(1000, 1000))
@btime mat_mul_2(A, B) setup=(A = rand(1000, 1000); B = rand(1000, 1000))

returning

9.764 ms (2 allocations: 7.63 MiB)
9.774 ms (2 allocations: 7.63 MiB)

Looks too me like @tensor is doing the right thingâ„˘ already.

The operation could be multithreaded already if it is dispatched to BLAS, see BLAS.set_num_threads. So I wouldnâ€™t expect to get more performance here.

Try using Tullio, LoopVectorization. This should speed it up.

julia> using Tullio, TensorOperations, LinearAlgebra
julia> BLAS.set_num_threads(@show(Threads.nthreads()));
Threads.nthreads() = 8
julia> A = rand(1000,1000); B = rand(1000,1000); C = similar(A);
julia> @btime @tensor $C[i,k] = $A[i,j]*$B[j,k];
2.575 ms (0 allocations: 0 bytes)
julia> @btime @tullio $C[i,k] = $A[i,j]*$B[j,k];
92.953 ms (114 allocations: 5.94 KiB)
julia> using LoopVectorization
julia> @btime @tullio $C[i,k] = $A[i,j]*$B[j,k];
3.956 ms (115 allocations: 5.97 KiB)

@tensor still wins for me, but @tullio at least improves a lot. @tullio is pure Julia, which is why it relies on Julia threads, and needs LoopVectorization for good performance on the CPU.