Improving performance of tensor contractions

Hi all,

I want to build a function that takes several tensors, contracts them, and outputs a tensor as a final result. I was using TensorOperations.jl as a draft, but I felt I could improve the performance making the contractions myself calling BLAS. The function and benchmarks compared with TensorOperations are here:

``````using LinearAlgebra, TensorOperations, BenchmarkTools

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
m = size(A, 1)
d = size(A, 2)
w = size(L, 2)

A = reshape(A, m, d*m)
L = reshape(L, m, w*m)
T1 = BLAS.gemm('T', 'N', L, A)
T1 = reshape(T1, w, m, d, m)
T1 = permutedims(T1, (1, 3, 2, 4))
T1 = reshape(T1, w*d, m^2)
W = reshape(W, w*d, d*w)
T2 = BLAS.gemm('T', 'N', T1, W)
T2 = reshape(T2, m, m, d, w)
T2 = permutedims(T2, (1, 3, 2, 4))
T2 = reshape(T2, m*d, m*w)
B = reshape(B, m*d, m)
Lnew = BLAS.gemm('T', 'N', T2, B)
Lnew = reshape(Lnew, m, w, m)
return Lnew
end

function tensorops(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
@tensor Lnew[r1, r2, r3] := (L[l1, l2, l3]*A[l1, s1, r1]
*W[l2, s1, s2, r2]*B[l3, s2, r3])
return Lnew
end

function tensorops_slices(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
@tensor begin
L1[l1, l2, s2, c] := L[l1, l2, l3]*B[l3, s2, c]
L2[l1, s1, b, c] := L1[l1, l2, s2, c]*W[l2, s1, s2, b]
Lnew[a, b, c] := L2[l1, s1, b, c]*A[l1, s1, a]
end
return Lnew
end

function run_benchmarks(m)
w = 100
d = 2
L = rand(m, w, m)
A = rand(m, d, m)
W = rand(w, d, d, w)
B = rand(m, d, m)
# Final tensor's size: (m, w, m).

L2 = tensorops_slices(L, A, W, B)
L3 = BL_prop_right3(L, A, W, B)
L4 = tensorops(L, A, W, B)

@btime BL_prop_right3(\$L, \$A, \$W, \$B)
@btime tensorops_slices(\$L, \$A, \$W, \$B)
@btime tensorops(\$L, \$A, \$W, \$B)
return
end

run_benchmarks(256)
``````

I get for `m=256`:

``````440.183 ms (38 allocations: 450.00 MiB)
528.910 ms (122 allocations: 250.02 MiB)
523.354 ms (220 allocations: 50.03 MiB)
``````

The time ratio of the TensorOperations functions against BLAS goes like

Do you think I could make the BLAS function even faster?

2 Likes

Hi, I similarly experimented with tensor contractions. You might find the original tread I started interesting.

Michiel

On my machine the results are much closer:

``````julia> run_benchmarks(256)
191.363 ms (38 allocations: 450.00 MiB)
195.040 ms (122 allocations: 250.02 MiB)
205.940 ms (220 allocations: 50.03 MiB)

julia> run_benchmarks(128)
43.785 ms (38 allocations: 112.50 MiB)
46.633 ms (122 allocations: 62.52 MiB)
46.100 ms (220 allocations: 12.53 MiB)

12

"8"
``````

Note that `@tensor` does a certain amount of multi-threading in Julia, which could perhaps be why you see differences. It also calls BLAS, which in my case is I think the built-in one. There is also a variant `@tensoropt` which takes information about which axes are long/short, docs here.

Thank you both for your help! I tried on my machine with 12 BLAS threads and my results are now comparable to @mcabbottâ€™s ones, although I still get the BLAS version about 1.5-2x faster than the other ones.

After trying with `@views` and `@strides` in `permutedis` I realized that the only way in which I could improve timings is aggressively doing preallocation:

``````using LinearAlgebra, TensorOperations, BenchmarkTools, Strided

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
m = size(A, 1)
d = size(A, 2)
w = size(L, 2)

A = reshape(A, m, d*m)
L = reshape(L, m, w*m)
T1 = BLAS.gemm('T', 'N', L, A)
T1 = reshape(T1, w, m, d, m)
T1 = permutedims(T1, (1, 3, 2, 4))
T1 = reshape(T1, w*d, m^2)
W = reshape(W, w*d, d*w)
T2 = BLAS.gemm('T', 'N', T1, W)
T2 = reshape(T2, m, m, d, w)
T2 = permutedims(T2, (1, 3, 2, 4))
T2 = reshape(T2, m*d, m*w)
B = reshape(B, m*d, m)
Lnew = BLAS.gemm('T', 'N', T2, B)
Lnew = reshape(Lnew, m, w, m)
return Lnew
end

"""
Use the preallocated tensors `T1p` and `T2p` to avoid copying data in
`permutedims` and `gemm`. Also reuse `L` in last `gemm` computation."""
function tweak!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3},
T1::Array{T, 2}, T2::Array{T, 2},
T1p::Array{T, 4}, T2p::Array{T, 4}) where T<:Number
m = size(A, 1)
d = size(A, 2)
w = size(L, 2)

A = reshape(A, m, d*m)
L = reshape(L, m, w*m)
BLAS.gemm!('T', 'N', 1., L, A, 0., T1)
T1 = reshape(T1, w, m, d, m)
permutedims!(T1p, T1, (1, 3, 2, 4))
T1p = reshape(T1p, w*d, m^2)
W = reshape(W, w*d, d*w)
BLAS.gemm!('T', 'N', 1., T1p, W, 0., T2)
T2 = reshape(T2, m, m, d, w)
permutedims!(T2p, T2, (1, 3, 2, 4))
T2p = reshape(T2p, m*d, m*w)
B = reshape(B, m*d, m)
L = reshape(L, m*w, m)
BLAS.gemm!('T', 'N', 1., T2p, B, 0., L)
L = reshape(L, m, w, m)
return L
end

function tensorops(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
@tensor Lnew[r1, r2, r3] := (L[l1, l2, l3]*A[l1, s1, r1]
*W[l2, s1, s2, r2]*B[l3, s2, r3])
return Lnew
end

function run_benchmarks(m)
w = 100
d = 2
L = rand(m, w, m)
A = rand(m, d, m)
W = rand(w, d, d, w)
B = rand(m, d, m)
# Final tensor's size: (m, w, m).

T1 = zeros(w*m, d*m)
T1p = zeros(w, d, m, m)
T2 = zeros(m^2, d*w)
T2p = zeros(m, d, m, w)

L2 = tensorops_slices(L, A, W, B)
L3 = BL_prop_right3(L, A, W, B)
L4 = tensorops(L, A, W, B)
L1 = tweak!(L, A, W, B, T1, T2, T1p, T2p)

L = rand(m, w, m)
L1p = zeros(m, w, d, m)
L2p = zeros(m, d, w, m)

@btime tweak!(\$L, \$A, \$W, \$B, \$T1, \$T2, \$T1p, \$T2p)
@btime BL_prop_right3(\$L, \$A, \$W, \$B)
@btime tensorops(\$L, \$A, \$W, \$B)
return
end
``````

I get

``````> run_benchmarks(128)
12.687 ms (24 allocations: 1.27 KiB)
25.277 ms (38 allocations: 112.50 MiB)
42.893 ms (210 allocations: 12.52 MiB)

> run_benchmarks(256)
64.850 ms (24 allocations: 1.27 KiB)
132.445 ms (38 allocations: 450.00 MiB)
158.259 ms (210 allocations: 50.02 MiB)

> run_benchmarks(512)
357.210 ms (24 allocations: 1.27 KiB)
598.253 ms (38 allocations: 1.76 GiB)
863.981 ms (210 allocations: 200.02 MiB)
``````

I could have avoided preallocating one of the two rank-4 arrays if `permutedims!` worked in-place, but that doesnâ€™t seem possible.

Could you report timings if you modify your `run_benchmarks` function so as to either run `tensorops` benchmark first, or to reset `L` after `@bench tweak!` ? The latter statement will typically blow up the entries of `L`, such that they become huge, and this can actually affect timings.

Also some info on your Julia version might be useful. You could also experiment with setting `JULIA_NUM_THREADS > 1`.

I restarted the `L` tensor before each operation and set a high number of threads.

``````using LinearAlgebra, TensorOperations, BenchmarkTools, Strided

function BL_prop_right3(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
m = size(A, 1)
d = size(A, 2)
w = size(L, 2)

A = reshape(A, m, d*m)
L = reshape(L, m, w*m)
T1 = BLAS.gemm('T', 'N', L, A)
T1 = reshape(T1, w, m, d, m)
T1 = permutedims(T1, (1, 3, 2, 4))
T1 = reshape(T1, w*d, m^2)
W = reshape(W, w*d, d*w)
T2 = BLAS.gemm('T', 'N', T1, W)
T2 = reshape(T2, m, m, d, w)
T2 = permutedims(T2, (1, 3, 2, 4))
T2 = reshape(T2, m*d, m*w)
B = reshape(B, m*d, m)
Lnew = BLAS.gemm('T', 'N', T2, B)
Lnew = reshape(Lnew, m, w, m)
return Lnew
end

function BL_prop_right3_prealloc!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3},
T1::Array{T, 2}, T2::Array{T, 2},
T1p::Array{T, 4}, T2p::Array{T, 4}) where T<:Number
m = size(A, 1)
d = size(A, 2)
w = size(L, 2)

A = reshape(A, m, d*m)
L = reshape(L, m, w*m)
BLAS.gemm!('T', 'N', 1., L, A, 0., T1)
T1 = reshape(T1, w, m, d, m)
permutedims!(T1p, T1, (1, 3, 2, 4))
T1p = reshape(T1p, w*d, m^2)
W = reshape(W, w*d, d*w)
BLAS.gemm!('T', 'N', 1., T1p, W, 0., T2)
T2 = reshape(T2, m, m, d, w)
permutedims!(T2p, T2, (1, 3, 2, 4))
T2p = reshape(T2p, m*d, m*w)
B = reshape(B, m*d, m)
L = reshape(L, m*w, m)
BLAS.gemm!('T', 'N', 1., T2p, B, 0., L)
L = reshape(L, m, w, m)
return L
end

function tensor!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
@tensor L[r1, r2, r3] = (L[l1, l2, l3]*A[l1, s1, r1]
*W[l2, s1, s2, r2]*B[l3, s2, r3])
return L
end

function tensoropt!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3}) where T<:Number
@tensoropt L[r1, r2, r3] = (L[l1, l2, l3]*A[l1, s1, r1]
*W[l2, s1, s2, r2]*B[l3, s2, r3])
return L
end

function tensor_slices!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3},
L1::Array{T, 4}, L2::Array{T, 4}) where T<:Number
@tensor begin
L1[l1, l2, s2, c] = L[l1, l2, l3]*B[l3, s2, c]
L2[l1, s1, b, c] = L1[l1, l2, s2, c]*W[l2, s1, s2, b]
L[a, b, c] = L2[l1, s1, b, c]*A[l1, s1, a]
end
return L
end

function run_benchmarks(m)
w = 100
d = 2
L = rand(m, w, m)
A = rand(m, d, m)
W = rand(w, d, d, w)
B = rand(m, d, m)
# Final tensor's size: (m, w, m).

# Store original L in a separate copy that won't be modified.
L0 = deepcopy(L)

# Arrays for preallocation purposes.
T1 = zeros(w*m, d*m)
T1p = zeros(w, d, m, m)
T2 = zeros(m^2, d*w)
T2p = zeros(m, d, m, w)
L1p = zeros(m, w, d, m)
L2p = zeros(m, d, w, m)

L = deepcopy(L0)
@btime BL_prop_right3_prealloc!(\$L, \$A, \$W, \$B, \$T1, \$T2, \$T1p, \$T2p)
L = deepcopy(L0)
@btime BL_prop_right3(\$L, \$A, \$W, \$B)
L = deepcopy(L0)
@btime tensor!(\$L, \$A, \$W, \$B)
L = deepcopy(L0)
@btime tensoropt!(\$L, \$A, \$W, \$B)
L = deepcopy(L0)
@btime tensor_slices!(\$L, \$A, \$W, \$B, \$L1p, \$L2p)

return
end

show(versioninfo())

run_benchmarks(128)

run_benchmarks(256)

run_benchmarks(512)
``````

The results were

``````12
12
Julia Version 1.1.0
Commit 80516ca202* (2019-01-21 21:24 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
nothing

# m = 128.

12.850 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
18.949 ms (38 allocations: 112.50 MiB) # BL_prop_right3
28.091 ms (218 allocations: 26.55 KiB) # tensor!
41.069 ms (218 allocations: 26.55 KiB) # tensoropt!
34.248 ms (114 allocations: 15.98 KiB) # tensor_slices!

# m = 256.

61.937 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
135.208 ms (38 allocations: 450.00 MiB) # BL_prop_right3
124.376 ms (218 allocations: 26.55 KiB) # tensor!
110.004 ms (218 allocations: 26.55 KiB) # tensoropt!
111.301 ms (114 allocations: 15.98 KiB) # tensor_slices!

# m = 512.

352.274 ms (24 allocations: 1.27 KiB) # BL_prop_right3_prealloc!
607.429 ms (38 allocations: 1.76 GiB) # BL_prop_right3
743.994 ms (218 allocations: 26.55 KiB) # tensor!
627.976 ms (218 allocations: 26.55 KiB) # tensoropt!
566.558 ms (114 allocations: 15.98 KiB) # tensor_slices!

``````

So in fact `@tensoropt` beat my original implementation at `m=256` without preallocation and it beats `@tensor` when `m` is larger. However, runtimes are of the order of ms, so there could be some overhead due to the computation of the contraction order, in special when compared with `tensor_slices`, where the â€ścorrectâ€ť contraction order is already imposed. Also note that in the functions using `TensorOperations.jl` I am also using preallocation.

I built Julia from source using Intel MKL.

I donâ€™t fully understand the reasoning behind `tensor_slices!`, in that it does operations in a different order than in your handwritten version. The way to write `tensor_slices!` such that it mimicks your handwritten version is as follows:

``````function tensor_slices!(L::Array{T, 3}, A::Array{T, 3},
W::Array{T, 4}, B::Array{T, 3},
L1::Array{T, 4}, L2::Array{T, 4},
L1p::Array{T, 4}, L2p::Array{T, 4}) where T<:Number
@tensor begin
L1[l2, l3, s1, r1] = L[l1, l2, l3]*A[l1, s1, r1]
L1p[l2, s1, l3, r1] = L1[l2, l3, s1, r1]
L2[l3, r1, s2, r2] = L1p[l2, s1, l3, r1]*W[l2, s1, s2, r2]
L2p[l3, s2, r1, r2] = L2[l3, r1, s2, r2]
L[r1, r2, r3] = L2p[l3, s2, r1, r2]*B[l3, s2, r3]
end
return L
end
``````

which then takes temporary arrays of size

``````    L1 = zeros(w, m, d, m)
L1p = zeros(w, d, m, m)
L2 = zeros(m, m, d, w)
L2p = zeros(m, d, m, w)
``````

With this, I get, using a single Julia thread

``````julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
``````

the following (only comparing `BL_prop_right3_prealloc!` and the new `tensor_slices!`)

``````julia> run_benchmarks(128); run_benchmarks(256); run_benchmarks(512)
13.801 ms (24 allocations: 1.27 KiB)
14.198 ms (22 allocations: 1.75 KiB)
67.250 ms (24 allocations: 1.27 KiB)
70.928 ms (22 allocations: 1.75 KiB)
362.352 ms (24 allocations: 1.27 KiB)
371.626 ms (22 allocations: 1.75 KiB)
``````

So these timings are comparable, as they should, even though TensorOperations.jl uses a different algorithm for `permutedims!`. Typically, it is more efficient, but in this case, where the permutations have as first entry `1`, i.e. source and destination share the first dimension, the one from base seems a bit faster (profiling shows that this is where the difference is, and that permutations take up most of the time, at least for m=128 and m=256). Now, the advantage of the algorithm in TensorOperations.jl/Strided.jl is that it supports multithreading, though probably it does not work well with too many threads. So using `JULIA_NUM_THREADS=4` instead, I obtain (now enabling all benchmarks):

``````julia> run_benchmarks(128); println("---------"); run_benchmarks(256); println("---------"); run_benchmarks(512)
14.024 ms (24 allocations: 1.27 KiB)
16.084 ms (38 allocations: 112.50 MiB)
12.400 ms (188 allocations: 13.50 KiB)
11.805 ms (188 allocations: 13.50 KiB)
9.677 ms (36 allocations: 3.56 KiB)
---------
68.877 ms (24 allocations: 1.27 KiB)
135.856 ms (38 allocations: 450.00 MiB)
63.799 ms (188 allocations: 13.50 KiB)
58.458 ms (188 allocations: 13.50 KiB)
55.489 ms (36 allocations: 3.56 KiB)
---------
366.578 ms (24 allocations: 1.27 KiB)
615.247 ms (38 allocations: 1.76 GiB)
364.872 ms (188 allocations: 13.50 KiB)
335.230 ms (188 allocations: 13.50 KiB)
323.419 ms (36 allocations: 3.56 KiB)
``````

This seems quite reasonable to me

I did notice you had `JULIA_NUM_THREADS = 12` on a CPU which only has 8 physical cores. I think that is asking for trouble (even though it has officially 16 threads, I would not count on hyperthreading).

2 Likes

Wow, nice! So indeed the contractions are faster when I use TensorOperations.jl and multithreading than the pure BLAS implementation + `Base.permutedims`.

Something I just realized is that the intermediate tensors L1 and L2 or T1 and T2 have equal lengths, so I can avoid at least two preallocations in some functions by reusing them, but that is specific to my problem.

You are right about the order of the contractions in `tensor_slices`, I had inverted it, although the output tensor is the same as the other functions. The tensor dimensions, however, are symmetric in the contraction order and shouldnâ€™t affect timings.

So to sum up:

• The only way I can improve the performance of my original `BL_prop_right3` just using BLAS and Base is to enable some cache, but there is nothing else I could do.

• TensorOperations.jl efficiently uses multithreading to speed up permutations and thus makes the whole tensor contraction faster.

So I will use your library then.