# Is there any way to optimize array additions and multiplications with transposes?

Yes. That I tried on the #17th reply, which appears in `test3(A) = 0.1 .* A .+ 0.2 .* PermutedDimsArray(A, (2,1,3,4))` in 14th reply. I would like to try the performance from `@tullio` and `@tensor`, but did not figure out an elegant way to do so. (`@tullio` crashed)

Oh right, sorry, that’s very similar.

Not sure exactly what the error is, but note that it’s not from Tullio, it’s from LoopVectorization. Tullio alone seems to run, but calling `+(x...)` with > 700 arguments is asking for problems – many things handling tuples stop being efficient at 32.

The whole thing seems a bit X-Y problem though. Why do you want to add all the permutations of a 6D-array? If you must, it’s linear, so you can just save the coefficients to disk and do it as one matrix-vector multiplication; at size `n=3` that’s 10^5 times faster.

2 Likes

I see. After removed `LoopVectorization`, `Tullio` takes forever…

I thought about defining a bigger array, take 4-dimensional case for example, `A[i,:,:,:,:] `. For each `i`, it corresponds to the `i`th permutation. Thus, the problem can be converted into
`A[i,:,:,:,:]* P[i] = B[:,:,:,:]` , where `P[i]` is the array of factors 0.1, 0.2, … corresponds to each permutation, as kind of matrix-vector multiplication. (sum over `i`)

I tried Fortran and Python. The thing is,the matrix-vector multiplication is faster, but building the bigger array takes longer time. In total, the time is similar to the direct summation over permutation. In `Julia`, I met some issues. Unable to do so. Maybe there is some trick to make it faster. I really want to see 10^5 faster

Here is my Python code

``````import numpy as np
import time
import itertools as it
import opt_einsum as oe

ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)

n_loop = 2
na = nb = nc = nd = 30

A = np.random.random((na,nb,nc,nd))

factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0

for n in range(n_loop):
sum_A = np.zeros((na,nb,nc,nd))
start_0 = time.time()
for m, t in enumerate(transpose_list):
if abs(factor_list[m]) < 1.e-3:
continue
np.add(sum_A, factor_list[m]  * np.transpose(A, transpose_list[m] ), out = sum_A)

finish_0 = time.time()
time_total += finish_0 - start_0

n_factor = 24
total_array = np.zeros((n_factor,na,nb,nc,nd))
factor_array = np.asarray(factor_list)

time_total = 0.0
for n in range(n_loop):
start_0 = time.time()
for m in range(n_factor):
total_array[m,:,:,:,:] = np.transpose(A, transpose_list[m] )

factor_array = np.asarray(factor_list)

oe.contract('nijkl,n->ijkl', total_array, factor_array)
finish_0 = time.time()
time_total += finish_0 - start_0

print('time for einsum',time_total/n_loop)

``````

I got

``````time for permu addition 0.10860311985015869
time for einsum 0.14762508869171143
``````

Here is my Julia code

``````using BenchmarkTools, Tullio, TensorOperations, Combinatorics, LoopVectorization

function perms(a)
B = collect(permutations(a))
B
end

function einsum(n)
@tullio C[i,j,k,l] := factor_p[n]  * A[n, i, j, k, l]
end

thresh = 0.0001
P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

n_factor = 24
factor_p = [0.1:0.1:2.4;]

total_A = zeros(n_factor, n, n, n, n)
for i = 1:n_factor
@tullio total_A[i,j,k,l,m] = PermutedDimsArray(A, P[i])[j,k,l,m]
end

@btime einsum(n_factor)
``````

I got

``````ERROR: LoadError: "expected a 5-array A"
Stacktrace:
 macro expansion
@ ~/.julia/packages/Tullio/wAFFh/src/macro.jl:983 [inlined]
...
``````

Sorry for being late to the party, but if you write it using views, broadcasting and `PermutedDimsArray` (or just `permutedims`), but put `@strided` from Strided.jl in front, you should get a decent speed up. Strided.jl is what speeds up the permutations in TensorOperations.jl, but can be used in itself using the `@strided` macro, which should combine well with most broadcasting expressions.

Furthermore, `Strided.jl` supports multithreading, so if you launch Julia with multiple threads, you could get some further speedup (although in the end this kind of operation is bandwidth limited rather than compute limited).

Thanks. I tried a bit.

using BenchmarkTools, Tullio, TensorOperations, Combinatorics, Strided

function perms(a)
B = collect(permutations(a))
B
end

sum_4 = zeros(n, n, n, n)
for i = 1:24
sum_4 = sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end

sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 = sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end

P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

I got

75.302 ms (315 allocations: 154.51 MiB)
81.919 ms (531 allocations: 154.53 MiB)

by julia 1.7.3
not much improvement Note that you have many allocations in both functions. The reason is that, in your loop, you are not storing the result in `sum4` but allocating a new `sum4` array every time, because you forgot a dot before the equal sign. Without this, broadcasting is used for the right hand side, but it is then materialized into a new array, which is then called `sum4`. So I guess you want

``````function perm_add_4(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
sum_4 .= sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end
sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end
``````

With these, I get, on my computer, using `julia -t 4` to use 4 threads:

``````julia> @btime perm_add_4(n)
33.952 ms (243 allocations: 6.19 MiB)

90.352 ms (2173 allocations: 154.67 MiB)
``````

Apparently there is something wrong with how I deal with `PermutedDimsArray`, if you make it within the `@strided` macro. There are two solutions, namely to just use `permutedims` within the macro call, or to make the `PermutedDimsArray` before the `@strided` call. (I will of course also try to fix this).

``````function perm_add_4_stride_1(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
B = PermutedDimsArray(A, P[i])
@strided sum_4 .= sum_4 .+ factor_p[i] .* B
end
end
sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* permutedims(A, P[i])
end
end
``````

With these, I find

``````julia> @btime perm_add_4_stride_1(n)
19.000 ms (2086 allocations: 6.38 MiB)

18.646 ms (2062 allocations: 6.38 MiB)
``````

so much better.

Finally, you can speed it up further by ‘unrolling’ the loop, i.e. doing multiple additions at once, so that you do not have to run over the data of `sum4` 24 times:

``````function perm_add_4_stride_unrolled(n)
sum_4 = zeros(n, n, n, n)
for i = 1:4:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* permutedims(A, P[i]) .+ factor_p[i+1] .* permutedims(A, P[i+1]) .+ factor_p[i+2] .* permutedims(A, P[i+2]) .+ factor_p[i+3] .* permutedims(A, P[i+3])
end
end
``````

which gives

``````julia> @btime perm_add_4_stride_unrolled(n)
10.937 ms (751 allocations: 6.28 MiB)
``````
1 Like

In addition to other problems, you are accessing non-`const` global variables inside your functions. This is very bad for performance, and is the very first performance tip: Performance Tips · The Julia Language

`A`, `P` and `factor_p` should be input arguments to your functions.

1 Like

Thanks @DNF, these were indeed some extra comments I wanted to add (but forgot) at the end of my post. To further improve performance, now it is indeed down to the general performance tips in terms of avoiding globals etc.

1 Like