TensorKit efficiency

I tried to run a contraction A[a,b]*B[b,c,d] = C[a,c,d], assume repeated index b as summation using TensorKit. By the following code, it seems TensorKit is even slower than nested loops. Does TensorKit use BLAS? or is there any setting I am missing? Thank you very much. Here is my code

#import Pkg; Pkg.add("TensorKit")
#import Pkg; Pkg.add("TensorOperations")
using TensorKit
#using TensorOperations
# A[a,b]*B[b,c,d]  = C[a,c,d]
na = nb = nc = nd = 12
A = rand(na, nb)
B = rand(nb, nc, nd)
C = zeros(na, nc, nd)
#println(A)

s = 0.0

@time begin
  for d in 1:nd
    for c in 1:nc
      for a in 1:na
        for b in 1:nb
          global C
          C[a,c,d] += A[a,b] * B[b,c,d]
        end  
      end 
    end 
  end   
end

print(C[1,2,3])
C = zeros(na, nc, nd)

@time begin
  @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
end
print(C[1,2,3])

I got 0.040242 second for nested loops and 12.562485 seconds for TensorKit .

In Julia, nested loops in general will be fast. Note that timing global variables access is a bad idea if you want good results.

Thanks. In python, if I use C2 = np.einsum('ab,bcd->acd', A, B) , I get ~2e-4 second, I expected the performance of julia loading BLAS as similar level.

If I wrote

function tensorkit_contraction(A,B,C)
  @time begin
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
  end
 # print(C[1,2,3])
end 

@time tensorkit_contraction(A,B,C)

Indeed, I see much different results

 0.000187 seconds
 7.285494 seconds

Does the second timing include the time for loading the TensorKit package? If the loading take quite long time, the advantage comparing with einsum-python, especially TensorKit supports symmetry, would not be that large.

The benchmarks you wrote aren’t measuring at all what you think they are.

Here is a benchmark that merasures the runtime performance (and also includes a Tullio.jl + LoopVectorization.jl benchmark for extra context)

using TensorKit
using Tullio, LoopVectorization # Tullio is pretty fast, but if LoopVectorization.jl is also loaded, it'll be even faster.

using BenchmarkTools # this package is essential for producing proper benchmarks

function contract!(C::Array{Float64, 3}, A::Array{Float64, 2}, B::Array{Float64, 3})
    na, nc, nd = size(C)
    nb = size(A, 2)
    @assert na == size(A, 1) && nc == size(B, 2) && nd == size(B, 3)
    for d in 1:nd
        for c in 1:nc
            for a in 1:na
                for b in 1:nb
                    C[a,c,d] += A[a,b] * B[b,c,d]
                end  
            end 
        end 
    end
    C
end
julia> let
           na = nb = nc = nd = 12
           A = rand(na, nb)
           B = rand(nb, nc, nd)
           C = zeros(na, nc, nd)
           @btime contract!($C, $A, $B)
           @btime @tensor $C[a,c,d] = $A[a,b] * $B[b,c,d]
           @btime @tullio $C[a,c,d] = $A[a,b] * $B[b,c,d]
       end;
  19.760 ÎĽs (0 allocations: 0 bytes)
  3.811 ÎĽs (1 allocation: 16 bytes)
  2.273 ÎĽs (0 allocations: 0 bytes)

Writing manual loops is fast, but for dense linear algebra it can easily leave up to an order of magnitude of performance on the table compared to smarter algorithms like those handwritten in BLAS libraries or generated by Tullio / LoopVectorization.jl

3 Likes

More generally though, I urge you to learn more about how to properly use and benchmark julia. The julia manual is a very good source of information about using the language, and the section on performance tips Performance Tips · The Julia Language is especially important if you want code that lives up to julia’s performance claims.

You will notice that the very first heading says in bold letters “Avoid Global Variables” and under that heading, it further mentions

Any code that is performance critical or being benchmarked should be inside a function.

Finally, GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language is a very important package for producing accurate and reliable benchmarks.

1 Like

Thanks for your suggestion and sample code.

Indeed, the @time tensorkit_contraction(A,B,C) leads to >99% compilation time. If I rerun this function 2nd time in the same input file, the time will get much smaller. It seems @time includes overhead time @time vs @btime

If I am permitted to reuse my stupid code,

function tensorkit_contraction(A,B,C)
  @time begin
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
  end
 # print(C[1,2,3])
end 

@time tensorkit_contraction(A,B,C)

@btime tensorkit_contraction(A,B,C)

I got ERROR: LoadError: LoadError: UndefVarError: @btime not defined. May I know, why I got @btime not defined error message?

provides @btime.

1 Like

Note that you have to do

using Pkg
Pkg.add("BenchmarkTools")

once to install it.

1 Like

Thanks. It works! Though @btime prints a lot of timing, I need to use

function tensorkit_contraction_2(A,B,C)
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
 # print(C[1,2,3])
end 
@btime tensorkit_contraction_2(A,B,C)

to report a single time.

You shouldn’t put an @time inside the function you’re benchmarking unless you want to benchmark how long it takes to record and print times, which I don’t think is your goal.

1 Like

Sorry for bringing this post back. I got one more question. Is there any parallelization for TensorKit? I run julia 1.jl -p 2, (suppose saved to 1.jl) did not get speed up though.

You should do

julia 1.jl -t2

instead of -p2. -p controls multiprocessing, but you probably want multi threading here. TensorKit should support multithreading.

1 Like