Is there any way to optimize array additions and multiplications with transposes?

Yes. That I tried on the #17th reply, which appears in test3(A) = 0.1 .* A .+ 0.2 .* PermutedDimsArray(A, (2,1,3,4)) in 14th reply. I would like to try the performance from @tullio and @tensor, but did not figure out an elegant way to do so. (@tullio crashed)

Oh right, sorry, that’s very similar.

Not sure exactly what the error is, but note that it’s not from Tullio, it’s from LoopVectorization. Tullio alone seems to run, but calling +(x...) with > 700 arguments is asking for problems – many things handling tuples stop being efficient at 32.

The whole thing seems a bit X-Y problem though. Why do you want to add all the permutations of a 6D-array? If you must, it’s linear, so you can just save the coefficients to disk and do it as one matrix-vector multiplication; at size n=3 that’s 10^5 times faster.

2 Likes

I see. After removed LoopVectorization, Tullio takes forever…

I thought about defining a bigger array, take 4-dimensional case for example, A[i,:,:,:,:] . For each i, it corresponds to the ith permutation. Thus, the problem can be converted into
A[i,:,:,:,:]* P[i] = B[:,:,:,:] , where P[i] is the array of factors 0.1, 0.2, … corresponds to each permutation, as kind of matrix-vector multiplication. (sum over i)

I tried Fortran and Python. The thing is,the matrix-vector multiplication is faster, but building the bigger array takes longer time. In total, the time is similar to the direct summation over permutation. In Julia, I met some issues. Unable to do so. Maybe there is some trick to make it faster. I really want to see 10^5 faster

Here is my Python code

import numpy as np
import time
import itertools as it
import opt_einsum as oe

ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)


n_loop = 2
na = nb = nc = nd = 30

A = np.random.random((na,nb,nc,nd))

factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0

for n in range(n_loop):
    sum_A = np.zeros((na,nb,nc,nd))
    start_0 = time.time()
    for m, t in enumerate(transpose_list):
       if abs(factor_list[m]) < 1.e-3:
           continue 
       np.add(sum_A, factor_list[m]  * np.transpose(A, transpose_list[m] ), out = sum_A) 

    finish_0 = time.time()
    time_total += finish_0 - start_0

print('time for permu addition', time_total/n_loop) 


n_factor = 24
total_array = np.zeros((n_factor,na,nb,nc,nd))
factor_array = np.asarray(factor_list)


time_total = 0.0
for n in range(n_loop):
  start_0 = time.time()
  for m in range(n_factor):
    total_array[m,:,:,:,:] = np.transpose(A, transpose_list[m] )

  factor_array = np.asarray(factor_list)

  oe.contract('nijkl,n->ijkl', total_array, factor_array)
  finish_0 = time.time()    
  time_total += finish_0 - start_0

print('time for einsum',time_total/n_loop)

I got

time for permu addition 0.10860311985015869
time for einsum 0.14762508869171143

Here is my Julia code

using BenchmarkTools, Tullio, TensorOperations, Combinatorics, LoopVectorization

function perms(a) 
    B = collect(permutations(a))
    B
end

function einsum(n)
    @tullio C[i,j,k,l] := factor_p[n]  * A[n, i, j, k, l]  
end    


thresh = 0.0001
P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

n_factor = 24
factor_p = [0.1:0.1:2.4;]


total_A = zeros(n_factor, n, n, n, n)
for i = 1:n_factor
    @tullio total_A[i,j,k,l,m] = PermutedDimsArray(A, P[i])[j,k,l,m]
end    

@btime einsum(n_factor)

I got

ERROR: LoadError: "expected a 5-array A"
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Tullio/wAFFh/src/macro.jl:983 [inlined]
...