Parallelization efficiency for @tensor @tullio

For the input

using TensorOperations
using LinearAlgebra
using BenchmarkTools
using Tullio

A = rand(1000, 1000)
B = rand(1000, 1000)

@btime @tensor C[i,k] := A[i,j]*B[j,k]
@btime @tullio  C[i,k] := A[i,j]*B[j,k]

I use export JULIA_NUM_THREADS=1 and export JULIA_NUM_THREADS=6 to set up parallel, I got

22.056 ms (3 allocations: 7.63 MiB)
  731.924 ms (2 allocations: 7.63 MiB)

and

  22.146 ms (3 allocations: 7.63 MiB)
  129.130 ms (77 allocations: 7.63 MiB)

there is not an acceleration in @tensor. @tullio there is acceleration, but much slower than @tensor. Is there any way to speed up @tensor or using other package?

Just checked

using TensorOperations
using LinearAlgebra
using LoopVectorization
using BenchmarkTools

function mat_mul_1(A, B)
    @tensor C[i,k] := A[i,j]*B[j,k]
    return C
end
function mat_mul_2(A, B)
    C = A * B
    return C
end   

@btime mat_mul_1(A, B) setup=(A = rand(1000, 1000); B = rand(1000, 1000))
@btime mat_mul_2(A, B) setup=(A = rand(1000, 1000); B = rand(1000, 1000))

returning

  9.764 ms (2 allocations: 7.63 MiB)
  9.774 ms (2 allocations: 7.63 MiB)

Looks too me like @tensor is doing the right thing™ already.

1 Like

Thanks. But, how to use more cores to speed up @tensor?

The operation could be multithreaded already if it is dispatched to BLAS, see BLAS.set_num_threads. So I wouldn’t expect to get more performance here.

1 Like

Thank you so much. After adding

BLAS.set_num_threads(2)

it indeed gets faster!

 12.883 ms (3 allocations: 7.63 MiB)
  836.022 ms (2 allocations: 7.63 MiB)

BLAS.set_num_threads(4)

 6.593 ms (3 allocations: 7.63 MiB)
  792.141 ms (2 allocations: 7.63 MiB)

it seems @tensor depends on BLAS.set_num_threads, @tullio depends on export JULIA_NUM_THREADS=

Try using Tullio, LoopVectorization. This should speed it up.

julia> using Tullio, TensorOperations, LinearAlgebra

julia> BLAS.set_num_threads(@show(Threads.nthreads()));
Threads.nthreads() = 8

julia> A = rand(1000,1000); B = rand(1000,1000); C = similar(A);

julia> @btime @tensor $C[i,k] = $A[i,j]*$B[j,k];
  2.575 ms (0 allocations: 0 bytes)

julia> @btime @tullio $C[i,k] = $A[i,j]*$B[j,k];
  92.953 ms (114 allocations: 5.94 KiB)

julia> using LoopVectorization

julia> @btime @tullio $C[i,k] = $A[i,j]*$B[j,k];
  3.956 ms (115 allocations: 5.97 KiB)

@tensor still wins for me, but @tullio at least improves a lot.
@tullio is pure Julia, which is why it relies on Julia threads, and needs LoopVectorization for good performance on the CPU.

1 Like