How do I implement the fastest tensor contractions

I am writing a program which has a tensor contraction with a 4d symmetrical tensor at it’s core. Needless to say, since it will be used inside a non linear problem solver in the future, I need this tensor contraction to evaluate as fast as possible.

My first attempt was to use the TensorOperations.jl package for which the implementation would look something like this:

using TensorOperations
using BenchmarkTools

n = 10;

@time A = ones(n, n, n, n);
@time B = rand(n, n);
@time u = rand(n);
@time v = rand(n);

@btime @tensor B[a, b] := A[a, b, c, d] * u[c] * v[d];

This runs pretty fast all together:

Output
  0.000040 seconds (4 allocations: 78.281 KiB)
  0.000010 seconds (1 allocation: 896 bytes)
  0.000009 seconds (1 allocation: 144 bytes)
  0.000012 seconds (1 allocation: 144 bytes)
  21.500 μs (45 allocations: 90.20 KiB)

My second attempt was to use the Tensorial.jl package which allows me to define a static symmtrical tensor to save space and duplications which looks something like this:

using Tensorial
using BenchmarkTools

n = 10;

@time A = Tensor{Tuple{@Symmetry{n,n,n,n}}}(ones(n, n, n, n))
@time u = Tensor{Tuple{n}}(rand(n))
@time v = Tensor{Tuple{n}}(rand(n))

@btime A ⋅ u ⋅ v;

Here the contraction itself is very fast but assigning the Tensor object is very very slow:

Output
131.126936 seconds (3.13 M allocations: 6.134 GiB, 0.31% gc time, 100.00% compilation time)
  0.398449 seconds (273.82 k allocations: 18.272 MiB, 2.93% gc time, 99.91% compilation time)
  0.000031 seconds (12 allocations: 720 bytes)
  369.903 ns (2 allocations: 2.20 KiB)

Here is my original Matlab code from when I originally wrote the project in matlab before rewtriting in Julia:

n = 100;

tic;    A = ones(n, n, n, n);   toc
tic;    u = rand(n);            toc
tic;    v = rand(n);            toc

tic;    tensorprod(tensorprod(A, u, 4, 1), v, 3, 1);    toc

This seems to already run slower than the TensorOperations.jl Package so at least using Julia will be an improvement either way but it is not the massive advantage I hoped for. (Note for larger n the Matlab code is only slightly slower than the TensorOperations.jl package.)

Output
Elapsed time is 0.000372 seconds.
Elapsed time is 0.000252 seconds.
Elapsed time is 0.000069 seconds.
Elapsed time is 0.000782 seconds.

Now, I would really like to keep the very fast contraction time of the Tensorial.jl package and I really only need to assign this big tensor once and then save it so that would make this the natural choice. Sadly I have to do these calculations for n>100 and I tried to construct this big symmetrical Tensor object but even after I left it running for an hour it did not do anything.

Is there a method I am missing to construct this symmetrical Tensor object faster or am I missing another obvious way to speed up this contraction? The Tensor.jl package does not seem to allow 4d tensors with a dimension bigger than 3 and I can’t seem to do tensor operations on static arrays from the StaticArrays.jl package.

Thanks in advance,
Laura

1 Like

Disclaimer: I am not familiar with any of the packages here.

I am not sure about your conclusions from these timings or I am missing the point. The first two times are completely dominated by compilation time. If you repeat the execution of these lines, then they should complete almost instantly.

Also the the timings you obtain with @btime are likely not quite accurate since you use untyped globals. Try to interpolate the variables with $ e.g.:

@btime $A ⋅ $u ⋅ $v;

It seems to me that Tensorial.jl performs a lot more optimizations during compile time to make execution as fast as possible, while TensorOperations.jl does optimize less. The huge compilation time could be worth paying depending on how long your full program runs. This you will need to evaluate for a more realistic example code. But first you should probably try and get accurate timings in these toy cases :slight_smile:

Edit: I quickly checked Tensorial.jl and indeed it works by putting all information about the tensor in the type domain similar to what StaticArrays.jl does. So this won’t work for large objects (> 1000 elements or so) because the compiler will just be overwhelmed. So Tensorial.jl is not the right package for your usecase.

1 Like

Thank you for your response.

I quickly checked Tensorial.jl and indeed it works by putting all information about the tensor in the type domain similar to what StaticArrays.jl does. So this won’t work for large objects (> 1000 elements or so) because the compiler will just be overwhelmed. So Tensorial.jl is not the right package for your usecase.

The tensor that I want to contract has 100 million elements or more so from what you are saying it would not be advisable to use Tensorial.jl here.

I am not sure about your conclusions from these timings or I am missing the point. The first two times are completely dominated by compilation time. If you repeat the execution of these lines, then they should complete almost instantly.

I am aware that when repeating the Tensorial.jl script it will complete almost instantly but the problem was the first time I run it. When I have the large tensor object I can simply write it to a file somewhere and load it when I need it.

It seems to me that Tensorial.jl performs a lot more optimizations during compile time to make execution as fast as possible, while TensorOperations.jl does optimize less. The huge compilation time could be worth paying depending on how long your full program runs.

I also thought I could get away with using Tensorial.jl if I just managed to construct the tensor but right now I have no idea of how long it would take to compile. It could be 10 hours or a year.

Also the the timings you obtain with @btime are likely not quite accurate since you use untyped globals. Try to interpolate the variables with $ e.g.:

So I should probably see about optimising the TensorOperations.jl code then?

Yes I think so. Or maybe someone else knows of another way of implementing the tensor contractions.

Tensorial.jl won’t work for your usecase that’s for sure because the compiler can’t handle millions of variables (which is probably how Tensorial.jl tries to generate fast code). However it is likely that the performance gap between Tensorial.jl and TensorOperations.jl shrinks the larger the tensors become. So don’t be sad about that :slight_smile:

1 Like

Hey

Yes I think so. Or maybe someone else knows of another way of implementing the tensor contractions.

I will keep the topic open for a little while longer then, maybe there is someone out there witha magic solution.

However it is likely that the performance gap between Tensorial.jl and TensorOperations.jl shrinks the larger the tensors become.

Let us hope so. It is a bit unfortunate that there is no way to implement symmetric arrays with TensorOperations.jl though. Of those 100 million elements only a small fraction are actually unique.

Thanks for the help :smile:

You are welcome :slight_smile:

After a quick search I found SymArrays.jl (github). Maybe this can help you make things faster?

I think you can just write straight-line code?

function contract!(B, A, u, v, ::Val{N}) where N
  N == size(B, 1) == size(B,2) == size(u, 1) == size(v, 1) == size(A,1) == size(A,2) == size(A,3) == size(A,4) || throw("dim mismatch")

  @inbounds  for i=1:N
    for j = 1:i
      acc = zero(eltype(B))
       for k=1:N
        acctmp = zero(eltype(B))
        @simd for ell=1:N
          acctmp += A[ell, k, j, i] * v[ell]
        end
        acc += acctmp * u[k]
      end
      B[i,j] = B[j, i] = acc
    end
  end
  B
end

with

julia> @btime contract!($B,$A,$u,$v,$Val(n));
  1.089 μs (0 allocations: 0 bytes)

I told the compiler to specialize on the size N by passing that as a Val: This makes the innermost loop simpler. The @inbounds should not be necessary – the compiler should know that all accesses are inbounds, since N is compile-time known and all sizes are N, otherwise we would have thrown an exception. I’d chalk that up to compiler-bug.

Specializing loops on this N is not so bad, compile-time wise; specializing datatypes with NTuple on large N takes very superlinear compile-time, unfortunately.

Properly using symmetry is probably nontrivial. I used symmetry twice: First to contract on the best dimensions, memory layout wise, and second in B[i,j] = B[j, i] = acc.

3 Likes

Thank you very much.

Since I’m still pretty new to Julia I am going to need some time to understand what you wrote there and how it compares to TensorOperations.jl.

I’ll come back to this when I have taken the time to properly understand. :sweat_smile:

Let me explain shortly: The code is a very direct translation of what tensor contraction / Einstein summation / Ricci calculus means.

More interesting is: Why should direct code like this be competitive?

The answer is: In B_{j i} = A_{\ell k j i}u^k v^\ell, each entry of A appears in only one summand. So the arithmetic density of your computation is abysmal, we will need roughly one memory load per multiplication! And there is nothing to do about it, this is fundamental to the kind of contraction you want (4-tensor with two 1-tensors into 2-tensor).

So your computation is a glorified memcpy, and our job is to max out your available memory bandwidth. No fancy tricks like blocking are needed: For n=100, u and v can both stay in L1 cache, and L2 and L3 cache are completely unused. Conceptually, you are doing a level-2 BLAS operation; it is easy to do this as fast as possible, because there is not a lot of room for optimizations.

This is opposed to e.g. matmul C_i^j = A_{i}^k B_k^j, i.e. contraction of two 2-tensors into one 2-tensor. Here, each entry of A and B is used n times, so it pays to make sure that as many of the n uses as possible come out of caches that are as close to the CPU as possible. This is a level-3 BLAS operation and there is so much room for optimizations!

Indeed,

julia> n=100; A = ones(n, n, n, n);B = rand(n, n);u = rand(n);v = rand(n);

julia> @btime contract!($B,$A,$u,$v,$Val(n));
  14.572 ms (0 allocations: 0 bytes)
julia> @btime copy!($Ac, $A);
  48.206 ms (0 allocations: 0 bytes)
julia> @btime sum($A)
  30.764 ms (0 allocations: 0 bytes)

We can be faster than sum because we only read half the tensor, using symmetry.

8 Likes

My appologies for the late reply, I had to spend some time in the hospital for reasons unrelated to tensor contractions. In the mean time I have understood your code.

Thank you.

3 Likes

I would be curious to hear if anyone has any tensor contraction-related hospital visits.

1 Like

A specialized version like that which was posted will probably always be fastest, but using something like Tullio.jl can sometimes be almost as good. Once LoopVectorization.jl is off the table in 1.11 though, who knows.

When I replace @tensor with @tullio after having done using Tullio, LoopVectorization, I get performance on par with the contract! function that was posted (I’m on 1.10.2).