TensorKit efficiency

Nellatur · April 18, 2021, 7:33pm

I tried to run a contraction A[a,b]*B[b,c,d] = C[a,c,d], assume repeated index b as summation using TensorKit. By the following code, it seems TensorKit is even slower than nested loops. Does TensorKit use BLAS? or is there any setting I am missing? Thank you very much. Here is my code

#import Pkg; Pkg.add("TensorKit")
#import Pkg; Pkg.add("TensorOperations")
using TensorKit
#using TensorOperations
# A[a,b]*B[b,c,d]  = C[a,c,d]
na = nb = nc = nd = 12
A = rand(na, nb)
B = rand(nb, nc, nd)
C = zeros(na, nc, nd)
#println(A)

s = 0.0

@time begin
  for d in 1:nd
    for c in 1:nc
      for a in 1:na
        for b in 1:nb
          global C
          C[a,c,d] += A[a,b] * B[b,c,d]
        end  
      end 
    end 
  end   
end

print(C[1,2,3])
C = zeros(na, nc, nd)

@time begin
  @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
end
print(C[1,2,3])

I got 0.040242 second for nested loops and 12.562485 seconds for TensorKit .

Oscar_Smith · April 18, 2021, 7:48pm

In Julia, nested loops in general will be fast. Note that timing global variables access is a bad idea if you want good results.

Nellatur · April 18, 2021, 8:07pm

Thanks. In python, if I use C2 = np.einsum('ab,bcd->acd', A, B) , I get ~2e-4 second, I expected the performance of julia loading BLAS as similar level.

If I wrote

function tensorkit_contraction(A,B,C)
  @time begin
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
  end
 # print(C[1,2,3])
end 

@time tensorkit_contraction(A,B,C)

Indeed, I see much different results

 0.000187 seconds
 7.285494 seconds

Does the second timing include the time for loading the TensorKit package? If the loading take quite long time, the advantage comparing with einsum-python, especially TensorKit supports symmetry, would not be that large.

Mason · April 18, 2021, 8:13pm

The benchmarks you wrote aren’t measuring at all what you think they are.

Here is a benchmark that merasures the runtime performance (and also includes a Tullio.jl + LoopVectorization.jl benchmark for extra context)

using TensorKit
using Tullio, LoopVectorization # Tullio is pretty fast, but if LoopVectorization.jl is also loaded, it'll be even faster.

using BenchmarkTools # this package is essential for producing proper benchmarks

function contract!(C::Array{Float64, 3}, A::Array{Float64, 2}, B::Array{Float64, 3})
    na, nc, nd = size(C)
    nb = size(A, 2)
    @assert na == size(A, 1) && nc == size(B, 2) && nd == size(B, 3)
    for d in 1:nd
        for c in 1:nc
            for a in 1:na
                for b in 1:nb
                    C[a,c,d] += A[a,b] * B[b,c,d]
                end  
            end 
        end 
    end
    C
end

julia> let
           na = nb = nc = nd = 12
           A = rand(na, nb)
           B = rand(nb, nc, nd)
           C = zeros(na, nc, nd)
           @btime contract!($C, $A, $B)
           @btime @tensor $C[a,c,d] = $A[a,b] * $B[b,c,d]
           @btime @tullio $C[a,c,d] = $A[a,b] * $B[b,c,d]
       end;
  19.760 μs (0 allocations: 0 bytes)
  3.811 μs (1 allocation: 16 bytes)
  2.273 μs (0 allocations: 0 bytes)

Writing manual loops is fast, but for dense linear algebra it can easily leave up to an order of magnitude of performance on the table compared to smarter algorithms like those handwritten in BLAS libraries or generated by Tullio / LoopVectorization.jl

Mason · April 18, 2021, 8:23pm

More generally though, I urge you to learn more about how to properly use and benchmark julia. The julia manual is a very good source of information about using the language, and the section on performance tips Performance Tips · The Julia Language is especially important if you want code that lives up to julia’s performance claims.

You will notice that the very first heading says in bold letters “Avoid Global Variables” and under that heading, it further mentions

Any code that is performance critical or being benchmarked should be inside a function.

Finally, GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language is a very important package for producing accurate and reliable benchmarks.

Nellatur · April 18, 2021, 9:13pm

Thanks for your suggestion and sample code.

Indeed, the @time tensorkit_contraction(A,B,C) leads to >99% compilation time. If I rerun this function 2nd time in the same input file, the time will get much smaller. It seems @time includes overhead time @time vs @btime

If I am permitted to reuse my stupid code,

function tensorkit_contraction(A,B,C)
  @time begin
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
  end
 # print(C[1,2,3])
end 

@time tensorkit_contraction(A,B,C)

@btime tensorkit_contraction(A,B,C)

I got ERROR: LoadError: LoadError: UndefVarError: @btime not defined. May I know, why I got @btime not defined error message?

jzr · April 18, 2021, 9:18pm

provides @btime.

Oscar_Smith · April 18, 2021, 9:20pm

Note that you have to do

using Pkg
Pkg.add("BenchmarkTools")

once to install it.

Nellatur · April 18, 2021, 9:24pm

Thanks. It works! Though @btime prints a lot of timing, I need to use

function tensorkit_contraction_2(A,B,C)
    @tensor  C[a,c,d] := A[a,b] * B[b,c,d]
 # print(C[1,2,3])
end 
@btime tensorkit_contraction_2(A,B,C)

to report a single time.

Mason · April 18, 2021, 9:53pm

You shouldn’t put an @time inside the function you’re benchmarking unless you want to benchmark how long it takes to record and print times, which I don’t think is your goal.

Nellatur · April 12, 2023, 5:31am

Sorry for bringing this post back. I got one more question. Is there any parallelization for TensorKit? I run julia 1.jl -p 2, (suppose saved to 1.jl) did not get speed up though.

Mason · April 12, 2023, 5:40am

You should do

julia 1.jl -t2

instead of -p2. -p controls multiprocessing, but you probably want multi threading here. TensorKit should support multithreading.

Topic		Replies	Views
Parallell TensorKit Numerics	5	603	April 20, 2023
Tensor contractions very inefficient in julia Performance tensors	13	2548	December 18, 2020
Tensor contractions with symmetries Numerics	3	820	January 13, 2021
Improving performance of tensor contractions Performance question , blas , tensors	7	1650	June 14, 2019
Tensor contration efficiency Numerics	4	952	January 6, 2021

TensorKit efficiency

Related topics