Improving performance of tensor contractions

Hi all,

I want to build a function that takes several tensors, contracts them, and outputs a tensor as a final result. I was using TensorOperations.jl as a draft, but I felt I could improve the performance making the contractions myself calling BLAS. The function and benchmarks compared with TensorOperations are here:

using LinearAlgebra, TensorOperations, BenchmarkTools

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3}, 
                        W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    m = size(A, 1)
    d = size(A, 2)
    w = size(L, 2)
    
    A = reshape(A, m, d*m)
    L = reshape(L, m, w*m)
    T1 = BLAS.gemm('T', 'N', L, A)
    T1 = reshape(T1, w, m, d, m)
    T1 = permutedims(T1, (1, 3, 2, 4))
    T1 = reshape(T1, w*d, m^2)
    W = reshape(W, w*d, d*w)
    T2 = BLAS.gemm('T', 'N', T1, W)
    T2 = reshape(T2, m, m, d, w)
    T2 = permutedims(T2, (1, 3, 2, 4))
    T2 = reshape(T2, m*d, m*w)
    B = reshape(B, m*d, m)
    Lnew = BLAS.gemm('T', 'N', T2, B)
    Lnew = reshape(Lnew, m, w, m)
    return Lnew
end

function tensorops(L::Array{T, 3}, A::Array{T, 3}, 
                   W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    @tensor Lnew[r1, r2, r3] := (L[l1, l2, l3]*A[l1, s1, r1]
                                 *W[l2, s1, s2, r2]*B[l3, s2, r3])
    return Lnew
end

function tensorops_slices(L::Array{T, 3}, A::Array{T, 3},
                          W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    @tensor begin
        L1[l1, l2, s2, c] := L[l1, l2, l3]*B[l3, s2, c]
        L2[l1, s1, b, c] := L1[l1, l2, s2, c]*W[l2, s1, s2, b]
        Lnew[a, b, c] := L2[l1, s1, b, c]*A[l1, s1, a]
    end
    return Lnew
end

function run_benchmarks(m)
    w = 100
    d = 2
    L = rand(m, w, m)
    A = rand(m, d, m)
    W = rand(w, d, d, w)
    B = rand(m, d, m)
    # Final tensor's size: (m, w, m).
    
    L2 = tensorops_slices(L, A, W, B)
    L3 = BL_prop_right3(L, A, W, B)
    L4 = tensorops(L, A, W, B)
    
    @btime BL_prop_right3($L, $A, $W, $B)
    @btime tensorops_slices($L, $A, $W, $B)
    @btime tensorops($L, $A, $W, $B)
    return
end

run_benchmarks(256)

I get for m=256:

440.183 ms (38 allocations: 450.00 MiB)
528.910 ms (122 allocations: 250.02 MiB)
523.354 ms (220 allocations: 50.03 MiB)

The time ratio of the TensorOperations functions against BLAS goes like

BLAS_time_ratio_prop_right3

Do you think I could make the BLAS function even faster?

2 Likes

Hi, I similarly experimented with tensor contractions. You might find the original tread I started interesting.

Michiel

On my machine the results are much closer:

julia> run_benchmarks(256)
  191.363 ms (38 allocations: 450.00 MiB)
  195.040 ms (122 allocations: 250.02 MiB)
  205.940 ms (220 allocations: 50.03 MiB)

julia> run_benchmarks(128)
  43.785 ms (38 allocations: 112.50 MiB)
  46.633 ms (122 allocations: 62.52 MiB)
  46.100 ms (220 allocations: 12.53 MiB)

julia> Threads.nthreads()
12

julia> ENV["OPENBLAS_NUM_THREADS"]
"8"

Note that @tensor does a certain amount of multi-threading in Julia, which could perhaps be why you see differences. It also calls BLAS, which in my case is I think the built-in one. There is also a variant @tensoropt which takes information about which axes are long/short, docs here.

Thank you both for your help! I tried on my machine with 12 BLAS threads and my results are now comparable to @mcabbott’s ones, although I still get the BLAS version about 1.5-2x faster than the other ones.

After trying with @views and @strides in permutedis I realized that the only way in which I could improve timings is aggressively doing preallocation:

using LinearAlgebra, TensorOperations, BenchmarkTools, Strided

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3}, 
                        W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    m = size(A, 1)
    d = size(A, 2)
    w = size(L, 2)
    
    A = reshape(A, m, d*m)
    L = reshape(L, m, w*m)
    T1 = BLAS.gemm('T', 'N', L, A)
    T1 = reshape(T1, w, m, d, m)
    T1 = permutedims(T1, (1, 3, 2, 4))
    T1 = reshape(T1, w*d, m^2)
    W = reshape(W, w*d, d*w)
    T2 = BLAS.gemm('T', 'N', T1, W)
    T2 = reshape(T2, m, m, d, w)
    T2 = permutedims(T2, (1, 3, 2, 4))
    T2 = reshape(T2, m*d, m*w)
    B = reshape(B, m*d, m)
    Lnew = BLAS.gemm('T', 'N', T2, B)
    Lnew = reshape(Lnew, m, w, m)
    return Lnew
end

"""
Use the preallocated tensors `T1p` and `T2p` to avoid copying data in
 `permutedims` and `gemm`. Also reuse `L` in last `gemm` computation."""
function tweak!(L::Array{T, 3}, A::Array{T, 3}, 
                W::Array{T, 4}, B::Array{T, 3},
                T1::Array{T, 2}, T2::Array{T, 2},
                T1p::Array{T, 4}, T2p::Array{T, 4}) where T<:Number
    m = size(A, 1)
    d = size(A, 2)
    w = size(L, 2)
    
    A = reshape(A, m, d*m)
    L = reshape(L, m, w*m)
    BLAS.gemm!('T', 'N', 1., L, A, 0., T1)
    T1 = reshape(T1, w, m, d, m)
    permutedims!(T1p, T1, (1, 3, 2, 4))
    T1p = reshape(T1p, w*d, m^2)
    W = reshape(W, w*d, d*w)
    BLAS.gemm!('T', 'N', 1., T1p, W, 0., T2)
    T2 = reshape(T2, m, m, d, w)
    permutedims!(T2p, T2, (1, 3, 2, 4))
    T2p = reshape(T2p, m*d, m*w)
    B = reshape(B, m*d, m)
    L = reshape(L, m*w, m)
    BLAS.gemm!('T', 'N', 1., T2p, B, 0., L)
    L = reshape(L, m, w, m)
    return L
end

function tensorops(L::Array{T, 3}, A::Array{T, 3}, 
                   W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    @tensor Lnew[r1, r2, r3] := (L[l1, l2, l3]*A[l1, s1, r1]
                                 *W[l2, s1, s2, r2]*B[l3, s2, r3])
    return Lnew
end

function run_benchmarks(m)
    w = 100
    d = 2
    L = rand(m, w, m)
    A = rand(m, d, m)
    W = rand(w, d, d, w)
    B = rand(m, d, m)
    # Final tensor's size: (m, w, m).
    
    T1 = zeros(w*m, d*m)
    T1p = zeros(w, d, m, m)
    T2 = zeros(m^2, d*w)
    T2p = zeros(m, d, m, w)
    
    L2 = tensorops_slices(L, A, W, B)
    L3 = BL_prop_right3(L, A, W, B)
    L4 = tensorops(L, A, W, B)
    L1 = tweak!(L, A, W, B, T1, T2, T1p, T2p)
    
    L = rand(m, w, m)
    L1p = zeros(m, w, d, m)
    L2p = zeros(m, d, w, m)
    
    BLAS.set_num_threads(8)
    @btime tweak!($L, $A, $W, $B, $T1, $T2, $T1p, $T2p)
    @btime BL_prop_right3($L, $A, $W, $B)
    @btime tensorops($L, $A, $W, $B)
    return
end

I get

> run_benchmarks(128)
  12.687 ms (24 allocations: 1.27 KiB)
  25.277 ms (38 allocations: 112.50 MiB)
  42.893 ms (210 allocations: 12.52 MiB)

> run_benchmarks(256)
  64.850 ms (24 allocations: 1.27 KiB)
  132.445 ms (38 allocations: 450.00 MiB)
  158.259 ms (210 allocations: 50.02 MiB)

> run_benchmarks(512)
  357.210 ms (24 allocations: 1.27 KiB)
  598.253 ms (38 allocations: 1.76 GiB)
  863.981 ms (210 allocations: 200.02 MiB)

I could have avoided preallocating one of the two rank-4 arrays if permutedims! worked in-place, but that doesn’t seem possible.

Could you report timings if you modify your run_benchmarks function so as to either run tensorops benchmark first, or to reset L after @bench tweak! ? The latter statement will typically blow up the entries of L, such that they become huge, and this can actually affect timings.

Also some info on your Julia version might be useful. You could also experiment with setting JULIA_NUM_THREADS > 1.

I restarted the L tensor before each operation and set a high number of threads.

using LinearAlgebra, TensorOperations, BenchmarkTools, Strided

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3}, 
                        W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    m = size(A, 1)
    d = size(A, 2)
    w = size(L, 2)
    
    A = reshape(A, m, d*m)
    L = reshape(L, m, w*m)
    T1 = BLAS.gemm('T', 'N', L, A)
    T1 = reshape(T1, w, m, d, m)
    T1 = permutedims(T1, (1, 3, 2, 4))
    T1 = reshape(T1, w*d, m^2)
    W = reshape(W, w*d, d*w)
    T2 = BLAS.gemm('T', 'N', T1, W)
    T2 = reshape(T2, m, m, d, w)
    T2 = permutedims(T2, (1, 3, 2, 4))
    T2 = reshape(T2, m*d, m*w)
    B = reshape(B, m*d, m)
    Lnew = BLAS.gemm('T', 'N', T2, B)
    Lnew = reshape(Lnew, m, w, m)
    return Lnew
end

function BL_prop_right3_prealloc!(L::Array{T, 3}, A::Array{T, 3}, 
                                  W::Array{T, 4}, B::Array{T, 3},
                                  T1::Array{T, 2}, T2::Array{T, 2},
                                  T1p::Array{T, 4}, T2p::Array{T, 4}) where T<:Number
    m = size(A, 1)
    d = size(A, 2)
    w = size(L, 2)
    
    A = reshape(A, m, d*m)
    L = reshape(L, m, w*m)
    BLAS.gemm!('T', 'N', 1., L, A, 0., T1)
    T1 = reshape(T1, w, m, d, m)
    permutedims!(T1p, T1, (1, 3, 2, 4))
    T1p = reshape(T1p, w*d, m^2)
    W = reshape(W, w*d, d*w)
    BLAS.gemm!('T', 'N', 1., T1p, W, 0., T2)
    T2 = reshape(T2, m, m, d, w)
    permutedims!(T2p, T2, (1, 3, 2, 4))
    T2p = reshape(T2p, m*d, m*w)
    B = reshape(B, m*d, m)
    L = reshape(L, m*w, m)
    BLAS.gemm!('T', 'N', 1., T2p, B, 0., L)
    L = reshape(L, m, w, m)
    return L
end

function tensor!(L::Array{T, 3}, A::Array{T, 3}, 
                  W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    @tensor L[r1, r2, r3] = (L[l1, l2, l3]*A[l1, s1, r1]
                             *W[l2, s1, s2, r2]*B[l3, s2, r3])
    return L
end

function tensoropt!(L::Array{T, 3}, A::Array{T, 3}, 
                     W::Array{T, 4}, B::Array{T, 3}) where T<:Number
    @tensoropt L[r1, r2, r3] = (L[l1, l2, l3]*A[l1, s1, r1]
                                *W[l2, s1, s2, r2]*B[l3, s2, r3])
    return L
end

function tensor_slices!(L::Array{T, 3}, A::Array{T, 3},
                        W::Array{T, 4}, B::Array{T, 3},
                        L1::Array{T, 4}, L2::Array{T, 4}) where T<:Number
    @tensor begin
        L1[l1, l2, s2, c] = L[l1, l2, l3]*B[l3, s2, c]
        L2[l1, s1, b, c] = L1[l1, l2, s2, c]*W[l2, s1, s2, b]
        L[a, b, c] = L2[l1, s1, b, c]*A[l1, s1, a]
    end
    return L
end

function run_benchmarks(m)
    w = 100
    d = 2
    L = rand(m, w, m)
    A = rand(m, d, m)
    W = rand(w, d, d, w)
    B = rand(m, d, m)
    # Final tensor's size: (m, w, m).
    
    # Store original L in a separate copy that won't be modified.
    L0 = deepcopy(L)
    
    # Arrays for preallocation purposes.
    T1 = zeros(w*m, d*m)
    T1p = zeros(w, d, m, m)
    T2 = zeros(m^2, d*w)
    T2p = zeros(m, d, m, w)    
    L1p = zeros(m, w, d, m)
    L2p = zeros(m, d, w, m)
    
    BLAS.set_num_threads(8)
    L = deepcopy(L0)
    @btime BL_prop_right3_prealloc!($L, $A, $W, $B, $T1, $T2, $T1p, $T2p)
    L = deepcopy(L0)
    @btime BL_prop_right3($L, $A, $W, $B)
    L = deepcopy(L0)
    @btime tensor!($L, $A, $W, $B)
    L = deepcopy(L0)
    @btime tensoropt!($L, $A, $W, $B)
    L = deepcopy(L0)
    @btime tensor_slices!($L, $A, $W, $B, $L1p, $L2p)

    return
end

println(ENV["MKL_NUM_THREADS"])
println(ENV["OPENBLAS_NUM_THREADS"])
show(versioninfo())

run_benchmarks(128)

run_benchmarks(256)

run_benchmarks(512)

The results were

12
12
Julia Version 1.1.0
Commit 80516ca202* (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 12
nothing  

# m = 128.

  12.850 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
  18.949 ms (38 allocations: 112.50 MiB) # BL_prop_right3
  28.091 ms (218 allocations: 26.55 KiB) # tensor!
  41.069 ms (218 allocations: 26.55 KiB) # tensoropt!
  34.248 ms (114 allocations: 15.98 KiB) # tensor_slices!

# m = 256.

  61.937 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
  135.208 ms (38 allocations: 450.00 MiB) # BL_prop_right3
  124.376 ms (218 allocations: 26.55 KiB) # tensor!
  110.004 ms (218 allocations: 26.55 KiB) # tensoropt!
  111.301 ms (114 allocations: 15.98 KiB) # tensor_slices!

# m = 512.

  352.274 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
  607.429 ms (38 allocations: 1.76 GiB) # BL_prop_right3
  743.994 ms (218 allocations: 26.55 KiB) # tensor!
  627.976 ms (218 allocations: 26.55 KiB) # tensoropt!
  566.558 ms (114 allocations: 15.98 KiB) # tensor_slices!

So in fact @tensoropt beat my original implementation at m=256 without preallocation and it beats @tensor when m is larger. However, runtimes are of the order of ms, so there could be some overhead due to the computation of the contraction order, in special when compared with tensor_slices, where the “correct” contraction order is already imposed. Also note that in the functions using TensorOperations.jl I am also using preallocation.

I built Julia from source using Intel MKL.

I don’t fully understand the reasoning behind tensor_slices!, in that it does operations in a different order than in your handwritten version. The way to write tensor_slices! such that it mimicks your handwritten version is as follows:

function tensor_slices!(L::Array{T, 3}, A::Array{T, 3}, 
                                  W::Array{T, 4}, B::Array{T, 3},
                                  L1::Array{T, 4}, L2::Array{T, 4},
                                  L1p::Array{T, 4}, L2p::Array{T, 4}) where T<:Number
    @tensor begin
        L1[l2, l3, s1, r1] = L[l1, l2, l3]*A[l1, s1, r1]
        L1p[l2, s1, l3, r1] = L1[l2, l3, s1, r1]
        L2[l3, r1, s2, r2] = L1p[l2, s1, l3, r1]*W[l2, s1, s2, r2]
        L2p[l3, s2, r1, r2] = L2[l3, r1, s2, r2]
        L[r1, r2, r3] = L2p[l3, s2, r1, r2]*B[l3, s2, r3]
    end
    return L
end

which then takes temporary arrays of size

    L1 = zeros(w, m, d, m)
    L1p = zeros(w, d, m, m)
    L2 = zeros(m, m, d, w)
    L2p = zeros(m, d, m, w)    

With this, I get, using a single Julia thread

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 1

the following (only comparing BL_prop_right3_prealloc! and the new tensor_slices!)

julia> run_benchmarks(128); run_benchmarks(256); run_benchmarks(512)
  13.801 ms (24 allocations: 1.27 KiB)
  14.198 ms (22 allocations: 1.75 KiB)
  67.250 ms (24 allocations: 1.27 KiB)
  70.928 ms (22 allocations: 1.75 KiB)
  362.352 ms (24 allocations: 1.27 KiB)
  371.626 ms (22 allocations: 1.75 KiB)

So these timings are comparable, as they should, even though TensorOperations.jl uses a different algorithm for permutedims!. Typically, it is more efficient, but in this case, where the permutations have as first entry 1, i.e. source and destination share the first dimension, the one from base seems a bit faster (profiling shows that this is where the difference is, and that permutations take up most of the time, at least for m=128 and m=256). Now, the advantage of the algorithm in TensorOperations.jl/Strided.jl is that it supports multithreading, though probably it does not work well with too many threads. So using JULIA_NUM_THREADS=4 instead, I obtain (now enabling all benchmarks):

julia> run_benchmarks(128); println("---------"); run_benchmarks(256); println("---------"); run_benchmarks(512)
  14.024 ms (24 allocations: 1.27 KiB)
  16.084 ms (38 allocations: 112.50 MiB)
  12.400 ms (188 allocations: 13.50 KiB)
  11.805 ms (188 allocations: 13.50 KiB)
  9.677 ms (36 allocations: 3.56 KiB)
---------
  68.877 ms (24 allocations: 1.27 KiB)
  135.856 ms (38 allocations: 450.00 MiB)
  63.799 ms (188 allocations: 13.50 KiB)
  58.458 ms (188 allocations: 13.50 KiB)
  55.489 ms (36 allocations: 3.56 KiB)
---------
  366.578 ms (24 allocations: 1.27 KiB)
  615.247 ms (38 allocations: 1.76 GiB)
  364.872 ms (188 allocations: 13.50 KiB)
  335.230 ms (188 allocations: 13.50 KiB)
  323.419 ms (36 allocations: 3.56 KiB)

This seems quite reasonable to me :slight_smile:

I did notice you had JULIA_NUM_THREADS = 12 on a CPU which only has 8 physical cores. I think that is asking for trouble (even though it has officially 16 threads, I would not count on hyperthreading).

2 Likes

Wow, nice! So indeed the contractions are faster when I use TensorOperations.jl and multithreading than the pure BLAS implementation + Base.permutedims.

Something I just realized is that the intermediate tensors L1 and L2 or T1 and T2 have equal lengths, so I can avoid at least two preallocations in some functions by reusing them, but that is specific to my problem.

You are right about the order of the contractions in tensor_slices, I had inverted it, although the output tensor is the same as the other functions. The tensor dimensions, however, are symmetric in the contraction order and shouldn’t affect timings.

So to sum up:

  • The only way I can improve the performance of my original BL_prop_right3 just using BLAS and Base is to enable some cache, but there is nothing else I could do.

  • TensorOperations.jl efficiently uses multithreading to speed up permutations and thus makes the whole tensor contraction faster.

So I will use your library then. :slightly_smiling_face:

Thank you all for answering and for the help you gave!