Is there any way to optimize array additions and multiplications with transposes?

that’s OK. I can try myself and see how it works. If I met some difficulty, I will ask.

2 Likes

Good, at least (having learned from StackOverflow that your matrices are pretty big) we can give you some obligatory advice: your kernel probably should look something like

@btime (B .= 0.1.*(A .+ transpose(A))) setup=(A=rand(10_000, 10_000); B=similar(A))

yielding

  1.952 s (0 allocations: 0 bytes)

that way avoiding allocation of intermediate results (via broadcasting) and pre-allocating the result B. Recommended way to measure is using BenchmarkTools’s @btime or @benchmark (avoid using global non constants there). If you are seeing unreasonable amounts of allocations when testing, better call us;) Later we could try to parallelize this one (although I’d somehow suspect the task is memory bound).

Edit: stupid me: broadcasting is needed, not @views.
Another edit: and obviously I didn’t try LoopVectorization and Tullio yet.

Thanks a lot. I did a preliminary test with a rank-4 array. I apologize that I changed a bit from

For example, A is a 2-dimensional array, one target is (in terms of numpy as np) 0.1A + 0.1np.transpose(A,(1,0))

since ultimately I would like to calculate higher-rank arrays with different factors in multiplications, similar to the link in the original question. Not sure if this is the best approach, that I gradually expand my test from the simplest.

The results from Julia and Numpy are comparable :frowning:

Here is my Julia Code

using BenchmarkTools, Tullio 

n = 30
A = rand(n, n, n, n)
B = rand(n, n, n, n)

@btime (@tullio C[a,b,c,d] := A[b,a,c,d])
@tullio C[a,b,c,d] := A[b,a,c,d]
@btime 0.1*A + 0.2*C

My Python code

import numpy as np
import time

n0=n1=n2=n3=30
M = np.random.random((n0,n1,n2,n3))

n_loop = 100

start = time.time()
for i in range(n_loop):
    M2 = np.add( 0.1*M, 0.2*np.transpose(M,(1,0,2,3)) )
end = time.time()
print('time M2 transpose', (end - start)/n_loop)

The results from Julia is

  1.455 ms (3 allocations: 6.18 MiB)
  3.179 ms (9 allocations: 18.54 MiB)

from Python

time M2 transpose 0.00432762861251831

The sum of the two steps from Julia is similar to Python. Maybe there are tricks in Julia I simply don’t know :frowning:

1 Like

I checked the following Python script

import numpy as np
import time

n0=n1=n2=n3=30
A = np.random.random((n0,n1,n2,n3))

n_loop = 100

start = time.time()
for i in range(n_loop):
    B = np.add( 0.1*A, 0.2*np.transpose(A,(1,0,2,3)) )
end = time.time()
print((end - start)/n_loop * 1000, " ms")

against Julia’s

using BenchmarkTools, LoopVectorization, Tullio

function test1(A)
    B = similar(A)
    for i in 1:size(A,1), j in 1:size(A,2)
        # B[i,j,:,:] = 0.1*A[i,j,:,:] + 0.2*A[i,j,:,:]'
        B[:,:,i,j] = 0.1*A[:,:,i,j] + 0.2*A[:,:,i,j]'
    end
    B
end    

function test2(A)
    B = similar(A)
    for i in 1:size(A,1), j in 1:size(A,2)
        # @. @views B[i,j,:,:] = 0.1*A[i,j,:,:] + 0.2*A[i,j,:,:]'
        @. @views B[:,:,i,j] = 0.1*A[:,:,i,j] + 0.2*A[:,:,i,j]'
    end
    B
end    

function test3(A)
    # @tullio B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[i,j,l,k]
    @tullio B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
    B
end    

@btime test1(A) setup=(n=30; A=rand(n,n,n,n))
@btime test2(A) setup=(n=30; A=rand(n,n,n,n))
@btime test3(A) setup=(n=30; A=rand(n,n,n,n))
nothing

Python yields

4.956309795379639  ms

Julia yields

  7.886 ms (5403 allocations: 37.81 MiB) # naive
  2.253 ms (3 allocations: 6.18 MiB) # naive, @views, broadcasting
  1.323 ms (44 allocations: 6.18 MiB) # @tullio

Observations:

  • for the naive versions ordering of memory access is relevant
  • I didn’t manage to test a version with LoopVectorization’s @turbo

I think nobody has given the most obvious translation, which uses permutedims (eager) or PermutedDimsArray (lazy) in place of np.transpose. For me this is faster than the other Base solutions.

TensorOperations.jl is often the fastest way to do permutedims on larger arrays. It has a smarter cache-friendly blocking algorithm than the one in Base. But how much this matters of course depends on size & permutation.

test4(A) = 0.1 .* A .+ 0.2 .* PermutedDimsArray(A, (2,1,3,4))

using Tullio, TensorOperations
test5(A) = @tullio B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]  avx=false  # no LV
test6(A) = @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
3 Likes

Very instructive! Thanks, here are the numbers for

using BenchmarkTools, LoopVectorization, Tullio, TensorOperations

function test1(A)
    B = similar(A)
    for i in 1:size(A,1), j in 1:size(A,2)
        B[:,:,i,j] = 0.1*A[:,:,i,j] + 0.2*A[:,:,i,j]'
    end
    B
end    

function test2(A)
    B = similar(A)
    for i in 1:size(A,1), j in 1:size(A,2)
        @. @views B[:,:,i,j] = 0.1*A[:,:,i,j] + 0.2*A[:,:,i,j]'
    end
    B
end    

test3(A) = 0.1 .* A .+ 0.2 .* PermutedDimsArray(A, (2,1,3,4))
test4(A) = @tullio B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
test5(A) = @tullio B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]  avx=false  # no LV
test6(A) = @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]

@btime test1(A) setup=(n=30; A=rand(n,n,n,n))
@btime test2(A) setup=(n=30; A=rand(n,n,n,n))
@btime test3(A) setup=(n=30; A=rand(n,n,n,n))
@btime test4(A) setup=(n=30; A=rand(n,n,n,n))
@btime test5(A) setup=(n=30; A=rand(n,n,n,n))
@btime test6(A) setup=(n=30; A=rand(n,n,n,n))
nothing

from my system for comparison

  7.980 ms (5403 allocations: 37.81 MiB) # naive
  2.280 ms (3 allocations: 6.18 MiB) # naive, @views, broadcasting
  1.894 ms (12 allocations: 6.18 MiB) # permuted dims
  1.341 ms (44 allocations: 6.18 MiB) # @tullio
  1.365 ms (44 allocations: 6.18 MiB) # @tullio, no LV
  1.096 ms (131 allocations: 6.19 MiB) # @tensor
1 Like

In my system, it is

  8.389 ms (5403 allocations: 37.81 MiB)
  1.664 ms (3 allocations: 6.18 MiB)
  1.669 ms (12 allocations: 6.18 MiB)
  1.293 ms (3 allocations: 6.18 MiB)
  1.588 ms (3 allocations: 6.18 MiB)
  2.585 ms (21 allocations: 6.18 MiB)

approach 4 is the fastest. And faster than python by ~ a factor of 3.

I noticed I should use -O3 in Fotran test. The code and results are

    
  
  Program transpose
  
    integer, parameter :: dp = selected_real_kind(15, 307)

    real(dp), dimension(:, :, :, :), allocatable :: a, b
    Integer :: n1, n2, n3, n4, n, m_iter
    Integer :: l1, l2, l3, l4  
    Integer :: start, finish, rate
    real(dp) :: sum_time
    
    Write(*, *) 'n1, n2, n3, n4?'
    Read(*, *) n1, n2, n3, n4

    Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
    Allocate( b ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
    
    Call Random_number( a )
    
    m_iter = 100

    call system_clock( start, rate ) 
    do n = 1, m_iter  
      b = 0.1*a + 0.2*reshape(a, (/n1, n2, n3, n4/), order = (/2,1,3,4/) ) 
    end do
    call system_clock( finish, rate )
    sum_time = real( finish - start, dp ) / rate  

    write (*,*) 'reshape time', sum_time/m_iter 

    
    call system_clock( start, rate )
    do n = 1, m_iter  
      do l2 = 1, n2
        do l1 = 1, n1
          b(l1,l2,:,:) = 0.1*a(l1,l2,:,:) + 0.2*a(l2,l1,:,:)
        end do
      end do        
    end do
    call system_clock( finish, rate )
    sum_time =  real( finish - start, dp ) / rate  

    write (*,*) 'reduced loop time', sum_time/m_iter

    Call system_clock( start, rate )
    do n = 1, m_iter  
      do l4 = 1, n4
        do l3 = 1, n3     
          do l2 = 1, n2
            do l1 = 1, n1
              b(l1,l2,l3,l4) = 0.1*a(l1,l2,l3,l4) + 0.2*a(l2,l1,l3,l4)
            end do
          end do                    
        end do
      end do        
    end do
    Call system_clock( finish, rate )
    sum_time =  real( finish - start, dp ) / rate  

    write (*,*) 'all loop', sum_time/m_iter 

  End 

by gfortran -O3 transpose.f90

 n1, n2, n3, n4?
30 30 30 30
 reshape time   3.3300000000000001E-003
 reduced loop time   1.0620000000000001E-002
 all loop   1.7000000000000001E-003

If I use loops explicily in Fortran without any fancy trick in transpose, Julia is a bit faster.

1 Like

Sorry there is one more question towards systematic test for all permutations.

Suppose I would like to compute
0.1 A[i,j,k,l] + 0.2 * A[i,j,l,k] + 0.3 * A[i,k,i,j] + 0.4 * A[i,k,j,i] + 0.5 * A[i,l,j,k]...
Is there any simple way to permutations, than specifying [i,k,j,i] etc (I can try to generate strings as the indices, but it may be a bit complicted)?
I used Get all Permutations of Array?

using BenchmarkTools, Combinatorics

function perms(a) 
    B = reverse(collect(permutations(a)))
    B
end

n=30
A=rand(Float64,(n,n,n,n))
@btime p = perms(A)

Got ERROR: LoadError: OverflowError: 531439031701620000 * 809997 overflowed for type Int64

And, if there is a prefactor equals zero, e.g., 0.0* A[i,l,k,j], can I skip the permutation to i,l,k,j? The permutations will generate all permutations as I understand it (In Python, I can use
np.transpose(A,(0,3,2,1)) to specify the given transpose.)

Edited: found PermutedDimsArray(A, (2,1,3,4)) in test3 is similar to Python transpose

1 Like

This has been implicitly addressed, but I’d like to point out that this allocates two intermediate arrays plus one output array. With broadcasting you can fuse the operations to avoid the intermediate arrays, like this

0.1 .*A .+ 0.2 .* C

Or avoid any allocation by reusing C:

C .= 0.1 .*A .+ 0.2 .* C

Also, when benchmarking, interpolate non-const globals:

@btime 0.1 .*$A .+ 0.2 .* $C
2 Likes

I extened my Julia code to include more permutations, closer to the objective in the original stackoverflow question.

Here is the Julia code

using BenchmarkTools, Tullio, TensorOperations, Combinatorics, LoopVectorization

function perms(a) 
    B = collect(permutations(a))
    B
end


function perm_add_4(n)
    sum_4 = zeros(n, n, n, n)
    for i = 1:24
        sum_4 = sum_4  .+  factor_p[i] .* PermutedDimsArray(A, P[i])
    end     
end    

function perm_add_5(n)
    sum_5 = zeros(n, n, n, n, n)
    for i = 1:120
        sum_5 = sum_5  .+  factor_p[i] .* PermutedDimsArray(A, P[i])
    end     
end    


P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

perm_add_4(n)
@btime perm_add_4(n)


P = perms([1,2,3,4,5])
factor_p = [0.1:0.1:12.0;]
n = 20
A = rand(n, n, n, n, n)
@btime perm_add_5(n)

But, this style is easy to adapt with test3(A) with PermutedDimsArray, harder for test4 ~ test6.

From TensorOperations.jl is often the fastest way to do permutedims on larger arrays. It has a smarter cache-friendly blocking algorithm than the one in Base. But how much this matters of course depends on size & permutation., is there any simple method to adapt with TensorOperations.jl?

1 Like

I tried to apply Tullio/tensor for more permutations similar to test4/6. I tested for 6 dimensional array. @tensor works. I did not figure out any elegant solution. I used Python to generate a long Julia code. But, @tullio leads to

ERROR: LoadError: BoundsError: attempt to access 721-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0]

Is there any solution/more elegant approach? Thank you very much



test10(A) = @tullio  B[i,j,k,l,m,n] := 0.1* A[i, j, k, l, m, n] +
0.2* A[i, j, k, l, n, m] +
0.30000000000000004* A[i, j, k, m, l, n] +
0.4* A[i, j, k, n, l, m] +
0.5* A[i, j, k, m, n, l] +
0.6000000000000001* A[i, j, k, n, m, l] +
0.7000000000000001* A[i, j, l, k, m, n] +
0.8* A[i, j, l, k, n, m] +
0.9* A[i, j, m, k, l, n] +
1.0* A[i, j, n, k, l, m] +
1.1* A[i, j, m, k, n, l] +
1.2000000000000002* A[i, j, n, k, m, l] +
1.3* A[i, j, l, m, k, n] +
1.4000000000000001* A[i, j, l, n, k, m] +
1.5* A[i, j, m, l, k, n] +
1.6* A[i, j, n, l, k, m] +
1.7000000000000002* A[i, j, m, n, k, l] +
1.8* A[i, j, n, m, k, l] +
1.9000000000000001* A[i, j, l, m, n, k] +
2.0* A[i, j, l, n, m, k] +
2.1* A[i, j, m, l, n, k] +
2.2* A[i, j, n, l, m, k] +
2.3000000000000003* A[i, j, m, n, l, k] +
2.4000000000000004* A[i, j, n, m, l, k] +
2.5* A[i, k, j, l, m, n] +
2.6* A[i, k, j, l, n, m] +
2.7* A[i, k, j, m, l, n] +
2.8000000000000003* A[i, k, j, n, l, m] +
2.9000000000000004* A[i, k, j, m, n, l] +
3.0* A[i, k, j, n, m, l] +
3.1* A[i, l, j, k, m, n] +
3.2* A[i, l, j, k, n, m] +
3.3000000000000003* A[i, m, j, k, l, n] +
3.4000000000000004* A[i, n, j, k, l, m] +
3.5* A[i, m, j, k, n, l] +
3.6* A[i, n, j, k, m, l] +
3.7* A[i, l, j, m, k, n] +
3.8000000000000003* A[i, l, j, n, k, m] +
3.9000000000000004* A[i, m, j, l, k, n] +
4.0* A[i, n, j, l, k, m] +
4.1000000000000005* A[i, m, j, n, k, l] +
4.2* A[i, n, j, m, k, l] +
4.3* A[i, l, j, m, n, k] +
4.4* A[i, l, j, n, m, k] +
4.5* A[i, m, j, l, n, k] +
4.6000000000000005* A[i, n, j, l, m, k] +
4.7* A[i, m, j, n, l, k] +
4.800000000000001* A[i, n, j, m, l, k] +
4.9* A[i, k, l, j, m, n] +
5.0* A[i, k, l, j, n, m] +
5.1000000000000005* A[i, k, m, j, l, n] +
5.2* A[i, k, n, j, l, m] +
5.300000000000001* A[i, k, m, j, n, l] +
5.4* A[i, k, n, j, m, l] +
5.5* A[i, l, k, j, m, n] +
5.6000000000000005* A[i, l, k, j, n, m] +
5.7* A[i, m, k, j, l, n] +
5.800000000000001* A[i, n, k, j, l, m] +
5.9* A[i, m, k, j, n, l] +
6.0* A[i, n, k, j, m, l] +
6.1000000000000005* A[i, l, m, j, k, n] +
6.2* A[i, l, n, j, k, m] +
6.300000000000001* A[i, m, l, j, k, n] +
6.4* A[i, n, l, j, k, m] +
6.5* A[i, m, n, j, k, l] +
6.6000000000000005* A[i, n, m, j, k, l] +
6.7* A[i, l, m, j, n, k] +
6.800000000000001* A[i, l, n, j, m, k] +
6.9* A[i, m, l, j, n, k] +
7.0* A[i, n, l, j, m, k] +
7.1000000000000005* A[i, m, n, j, l, k] +
7.2* A[i, n, m, j, l, k] +
7.300000000000001* A[i, k, l, m, j, n] +
7.4* A[i, k, l, n, j, m] +
7.5* A[i, k, m, l, j, n] +
7.6000000000000005* A[i, k, n, l, j, m] +
7.7* A[i, k, m, n, j, l] +
7.800000000000001* A[i, k, n, m, j, l] +
7.9* A[i, l, k, m, j, n] +
8.0* A[i, l, k, n, j, m] +
8.1* A[i, m, k, l, j, n] +
8.200000000000001* A[i, n, k, l, j, m] +
8.3* A[i, m, k, n, j, l] +
8.4* A[i, n, k, m, j, l] +
8.5* A[i, l, m, k, j, n] +
8.6* A[i, l, n, k, j, m] +
8.700000000000001* A[i, m, l, k, j, n] +
8.8* A[i, n, l, k, j, m] +
8.9* A[i, m, n, k, j, l] +
9.0* A[i, n, m, k, j, l] +
9.1* A[i, l, m, n, j, k] +
9.200000000000001* A[i, l, n, m, j, k] +
9.3* A[i, m, l, n, j, k] +
9.4* A[i, n, l, m, j, k] +
9.5* A[i, m, n, l, j, k] +
9.600000000000001* A[i, n, m, l, j, k] +
9.700000000000001* A[i, k, l, m, n, j] +
9.8* A[i, k, l, n, m, j] +
9.9* A[i, k, m, l, n, j] +
10.0* A[i, k, n, l, m, j] +
10.100000000000001* A[i, k, m, n, l, j] +
10.200000000000001* A[i, k, n, m, l, j] +
10.3* A[i, l, k, m, n, j] +
10.4* A[i, l, k, n, m, j] +
10.5* A[i, m, k, l, n, j] +
10.600000000000001* A[i, n, k, l, m, j] +
10.700000000000001* A[i, m, k, n, l, j] +
10.8* A[i, n, k, m, l, j] +
10.9* A[i, l, m, k, n, j] +
11.0* A[i, l, n, k, m, j] +
11.100000000000001* A[i, m, l, k, n, j] +
11.200000000000001* A[i, n, l, k, m, j] +
11.3* A[i, m, n, k, l, j] +
11.4* A[i, n, m, k, l, j] +
11.5* A[i, l, m, n, k, j] +
11.600000000000001* A[i, l, n, m, k, j] +
11.700000000000001* A[i, m, l, n, k, j] +
11.8* A[i, n, l, m, k, j] +
11.9* A[i, m, n, l, k, j] +
12.0* A[i, n, m, l, k, j] +
12.100000000000001* A[j, i, k, l, m, n] +
12.200000000000001* A[j, i, k, l, n, m] +
12.3* A[j, i, k, m, l, n] +
12.4* A[j, i, k, n, l, m] +
12.5* A[j, i, k, m, n, l] +
12.600000000000001* A[j, i, k, n, m, l] +
12.700000000000001* A[j, i, l, k, m, n] +
12.8* A[j, i, l, k, n, m] +
12.9* A[j, i, m, k, l, n] +
13.0* A[j, i, n, k, l, m] +
13.100000000000001* A[j, i, m, k, n, l] +
13.200000000000001* A[j, i, n, k, m, l] +
13.3* A[j, i, l, m, k, n] +
13.4* A[j, i, l, n, k, m] +
13.5* A[j, i, m, l, k, n] +
13.600000000000001* A[j, i, n, l, k, m] +
13.700000000000001* A[j, i, m, n, k, l] +
13.8* A[j, i, n, m, k, l] +
13.9* A[j, i, l, m, n, k] +
14.0* A[j, i, l, n, m, k] +
14.100000000000001* A[j, i, m, l, n, k] +
14.200000000000001* A[j, i, n, l, m, k] +
14.3* A[j, i, m, n, l, k] +
14.4* A[j, i, n, m, l, k] +
14.5* A[k, i, j, l, m, n] +
14.600000000000001* A[k, i, j, l, n, m] +
14.700000000000001* A[k, i, j, m, l, n] +
14.8* A[k, i, j, n, l, m] +
14.9* A[k, i, j, m, n, l] +
15.0* A[k, i, j, n, m, l] +
15.100000000000001* A[l, i, j, k, m, n] +
15.200000000000001* A[l, i, j, k, n, m] +
15.3* A[m, i, j, k, l, n] +
15.4* A[n, i, j, k, l, m] +
15.5* A[m, i, j, k, n, l] +
15.600000000000001* A[n, i, j, k, m, l] +
15.700000000000001* A[l, i, j, m, k, n] +
15.8* A[l, i, j, n, k, m] +
15.9* A[m, i, j, l, k, n] +
16.0* A[n, i, j, l, k, m] +
16.1* A[m, i, j, n, k, l] +
16.2* A[n, i, j, m, k, l] +
16.3* A[l, i, j, m, n, k] +
16.400000000000002* A[l, i, j, n, m, k] +
16.5* A[m, i, j, l, n, k] +
16.6* A[n, i, j, l, m, k] +
16.7* A[m, i, j, n, l, k] +
16.8* A[n, i, j, m, l, k] +
16.900000000000002* A[k, i, l, j, m, n] +
17.0* A[k, i, l, j, n, m] +
17.1* A[k, i, m, j, l, n] +
17.2* A[k, i, n, j, l, m] +
17.3* A[k, i, m, j, n, l] +
17.400000000000002* A[k, i, n, j, m, l] +
17.5* A[l, i, k, j, m, n] +
17.6* A[l, i, k, j, n, m] +
17.7* A[m, i, k, j, l, n] +
17.8* A[n, i, k, j, l, m] +
17.900000000000002* A[m, i, k, j, n, l] +
18.0* A[n, i, k, j, m, l] +
18.1* A[l, i, m, j, k, n] +
18.2* A[l, i, n, j, k, m] +
18.3* A[m, i, l, j, k, n] +
18.400000000000002* A[n, i, l, j, k, m] +
18.5* A[m, i, n, j, k, l] +
18.6* A[n, i, m, j, k, l] +
18.7* A[l, i, m, j, n, k] +
18.8* A[l, i, n, j, m, k] +
18.900000000000002* A[m, i, l, j, n, k] +
19.0* A[n, i, l, j, m, k] +
19.1* A[m, i, n, j, l, k] +
19.200000000000003* A[n, i, m, j, l, k] +
19.3* A[k, i, l, m, j, n] +
19.400000000000002* A[k, i, l, n, j, m] +
19.5* A[k, i, m, l, j, n] +
19.6* A[k, i, n, l, j, m] +
19.700000000000003* A[k, i, m, n, j, l] +
19.8* A[k, i, n, m, j, l] +
19.900000000000002* A[l, i, k, m, j, n] +
20.0* A[l, i, k, n, j, m] +
20.1* A[m, i, k, l, j, n] +
20.200000000000003* A[n, i, k, l, j, m] +
20.3* A[m, i, k, n, j, l] +
20.400000000000002* A[n, i, k, m, j, l] +
20.5* A[l, i, m, k, j, n] +
20.6* A[l, i, n, k, j, m] +
20.700000000000003* A[m, i, l, k, j, n] +
20.8* A[n, i, l, k, j, m] +
20.900000000000002* A[m, i, n, k, j, l] +
21.0* A[n, i, m, k, j, l] +
21.1* A[l, i, m, n, j, k] +
21.200000000000003* A[l, i, n, m, j, k] +
21.3* A[m, i, l, n, j, k] +
21.400000000000002* A[n, i, l, m, j, k] +
21.5* A[m, i, n, l, j, k] +
21.6* A[n, i, m, l, j, k] +
21.700000000000003* A[k, i, l, m, n, j] +
21.8* A[k, i, l, n, m, j] +
21.900000000000002* A[k, i, m, l, n, j] +
22.0* A[k, i, n, l, m, j] +
22.1* A[k, i, m, n, l, j] +
22.200000000000003* A[k, i, n, m, l, j] +
22.3* A[l, i, k, m, n, j] +
22.400000000000002* A[l, i, k, n, m, j] +
22.5* A[m, i, k, l, n, j] +
22.6* A[n, i, k, l, m, j] +
22.700000000000003* A[m, i, k, n, l, j] +
22.8* A[n, i, k, m, l, j] +
22.900000000000002* A[l, i, m, k, n, j] +
23.0* A[l, i, n, k, m, j] +
23.1* A[m, i, l, k, n, j] +
23.200000000000003* A[n, i, l, k, m, j] +
23.3* A[m, i, n, k, l, j] +
23.400000000000002* A[n, i, m, k, l, j] +
23.5* A[l, i, m, n, k, j] +
23.6* A[l, i, n, m, k, j] +
23.700000000000003* A[m, i, l, n, k, j] +
23.8* A[n, i, l, m, k, j] +
23.900000000000002* A[m, i, n, l, k, j] +
24.0* A[n, i, m, l, k, j] +
24.1* A[j, k, i, l, m, n] +
24.200000000000003* A[j, k, i, l, n, m] +
24.3* A[j, k, i, m, l, n] +
24.400000000000002* A[j, k, i, n, l, m] +
24.5* A[j, k, i, m, n, l] +
24.6* A[j, k, i, n, m, l] +
24.700000000000003* A[j, l, i, k, m, n] +
24.8* A[j, l, i, k, n, m] +
24.900000000000002* A[j, m, i, k, l, n] +
25.0* A[j, n, i, k, l, m] +
25.1* A[j, m, i, k, n, l] +
25.200000000000003* A[j, n, i, k, m, l] +
25.3* A[j, l, i, m, k, n] +
25.400000000000002* A[j, l, i, n, k, m] +
25.5* A[j, m, i, l, k, n] +
25.6* A[j, n, i, l, k, m] +
25.700000000000003* A[j, m, i, n, k, l] +
25.8* A[j, n, i, m, k, l] +
25.900000000000002* A[j, l, i, m, n, k] +
26.0* A[j, l, i, n, m, k] +
26.1* A[j, m, i, l, n, k] +
26.200000000000003* A[j, n, i, l, m, k] +
26.3* A[j, m, i, n, l, k] +
26.400000000000002* A[j, n, i, m, l, k] +
26.5* A[k, j, i, l, m, n] +
26.6* A[k, j, i, l, n, m] +
26.700000000000003* A[k, j, i, m, l, n] +
26.8* A[k, j, i, n, l, m] +
26.900000000000002* A[k, j, i, m, n, l] +
27.0* A[k, j, i, n, m, l] +
27.1* A[l, j, i, k, m, n] +
27.200000000000003* A[l, j, i, k, n, m] +
27.3* A[m, j, i, k, l, n] +
27.400000000000002* A[n, j, i, k, l, m] +
27.5* A[m, j, i, k, n, l] +
27.6* A[n, j, i, k, m, l] +
27.700000000000003* A[l, j, i, m, k, n] +
27.8* A[l, j, i, n, k, m] +
27.900000000000002* A[m, j, i, l, k, n] +
28.0* A[n, j, i, l, k, m] +
28.1* A[m, j, i, n, k, l] +
28.200000000000003* A[n, j, i, m, k, l] +
28.3* A[l, j, i, m, n, k] +
28.400000000000002* A[l, j, i, n, m, k] +
28.5* A[m, j, i, l, n, k] +
28.6* A[n, j, i, l, m, k] +
28.700000000000003* A[m, j, i, n, l, k] +
28.8* A[n, j, i, m, l, k] +
28.900000000000002* A[k, l, i, j, m, n] +
29.0* A[k, l, i, j, n, m] +
29.1* A[k, m, i, j, l, n] +
29.200000000000003* A[k, n, i, j, l, m] +
29.3* A[k, m, i, j, n, l] +
29.400000000000002* A[k, n, i, j, m, l] +
29.5* A[l, k, i, j, m, n] +
29.6* A[l, k, i, j, n, m] +
29.700000000000003* A[m, k, i, j, l, n] +
29.8* A[n, k, i, j, l, m] +
29.900000000000002* A[m, k, i, j, n, l] +
30.0* A[n, k, i, j, m, l] +
30.1* A[l, m, i, j, k, n] +
30.200000000000003* A[l, n, i, j, k, m] +
30.3* A[m, l, i, j, k, n] +
30.400000000000002* A[n, l, i, j, k, m] +
30.5* A[m, n, i, j, k, l] +
30.6* A[n, m, i, j, k, l] +
30.700000000000003* A[l, m, i, j, n, k] +
30.8* A[l, n, i, j, m, k] +
30.900000000000002* A[m, l, i, j, n, k] +
31.0* A[n, l, i, j, m, k] +
31.1* A[m, n, i, j, l, k] +
31.200000000000003* A[n, m, i, j, l, k] +
31.3* A[k, l, i, m, j, n] +
31.400000000000002* A[k, l, i, n, j, m] +
31.5* A[k, m, i, l, j, n] +
31.6* A[k, n, i, l, j, m] +
31.700000000000003* A[k, m, i, n, j, l] +
31.8* A[k, n, i, m, j, l] +
31.900000000000002* A[l, k, i, m, j, n] +
32.0* A[l, k, i, n, j, m] +
32.1* A[m, k, i, l, j, n] +
32.2* A[n, k, i, l, j, m] +
32.300000000000004* A[m, k, i, n, j, l] +
32.4* A[n, k, i, m, j, l] +
32.5* A[l, m, i, k, j, n] +
32.6* A[l, n, i, k, j, m] +
32.7* A[m, l, i, k, j, n] +
32.800000000000004* A[n, l, i, k, j, m] +
32.9* A[m, n, i, k, j, l] +
33.0* A[n, m, i, k, j, l] +
33.1* A[l, m, i, n, j, k] +
33.2* A[l, n, i, m, j, k] +
33.300000000000004* A[m, l, i, n, j, k] +
33.4* A[n, l, i, m, j, k] +
33.5* A[m, n, i, l, j, k] +
33.6* A[n, m, i, l, j, k] +
33.7* A[k, l, i, m, n, j] +
33.800000000000004* A[k, l, i, n, m, j] +
33.9* A[k, m, i, l, n, j] +
34.0* A[k, n, i, l, m, j] +
34.1* A[k, m, i, n, l, j] +
34.2* A[k, n, i, m, l, j] +
34.300000000000004* A[l, k, i, m, n, j] +
34.4* A[l, k, i, n, m, j] +
34.5* A[m, k, i, l, n, j] +
34.6* A[n, k, i, l, m, j] +
34.7* A[m, k, i, n, l, j] +
34.800000000000004* A[n, k, i, m, l, j] +
34.9* A[l, m, i, k, n, j] +
35.0* A[l, n, i, k, m, j] +
35.1* A[m, l, i, k, n, j] +
35.2* A[n, l, i, k, m, j] +
35.300000000000004* A[m, n, i, k, l, j] +
35.4* A[n, m, i, k, l, j] +
35.5* A[l, m, i, n, k, j] +
35.6* A[l, n, i, m, k, j] +
35.7* A[m, l, i, n, k, j] +
35.800000000000004* A[n, l, i, m, k, j] +
35.9* A[m, n, i, l, k, j] +
36.0* A[n, m, i, l, k, j] +
36.1* A[j, k, l, i, m, n] +
36.2* A[j, k, l, i, n, m] +
36.300000000000004* A[j, k, m, i, l, n] +
36.4* A[j, k, n, i, l, m] +
36.5* A[j, k, m, i, n, l] +
36.6* A[j, k, n, i, m, l] +
36.7* A[j, l, k, i, m, n] +
36.800000000000004* A[j, l, k, i, n, m] +
36.9* A[j, m, k, i, l, n] +
37.0* A[j, n, k, i, l, m] +
37.1* A[j, m, k, i, n, l] +
37.2* A[j, n, k, i, m, l] +
37.300000000000004* A[j, l, m, i, k, n] +
37.4* A[j, l, n, i, k, m] +
37.5* A[j, m, l, i, k, n] +
37.6* A[j, n, l, i, k, m] +
37.7* A[j, m, n, i, k, l] +
37.800000000000004* A[j, n, m, i, k, l] +
37.9* A[j, l, m, i, n, k] +
38.0* A[j, l, n, i, m, k] +
38.1* A[j, m, l, i, n, k] +
38.2* A[j, n, l, i, m, k] +
38.300000000000004* A[j, m, n, i, l, k] +
38.400000000000006* A[j, n, m, i, l, k] +
38.5* A[k, j, l, i, m, n] +
38.6* A[k, j, l, i, n, m] +
38.7* A[k, j, m, i, l, n] +
38.800000000000004* A[k, j, n, i, l, m] +
38.900000000000006* A[k, j, m, i, n, l] +
39.0* A[k, j, n, i, m, l] +
39.1* A[l, j, k, i, m, n] +
39.2* A[l, j, k, i, n, m] +
39.300000000000004* A[m, j, k, i, l, n] +
39.400000000000006* A[n, j, k, i, l, m] +
39.5* A[m, j, k, i, n, l] +
39.6* A[n, j, k, i, m, l] +
39.7* A[l, j, m, i, k, n] +
39.800000000000004* A[l, j, n, i, k, m] +
39.900000000000006* A[m, j, l, i, k, n] +
40.0* A[n, j, l, i, k, m] +
40.1* A[m, j, n, i, k, l] +
40.2* A[n, j, m, i, k, l] +
40.300000000000004* A[l, j, m, i, n, k] +
40.400000000000006* A[l, j, n, i, m, k] +
40.5* A[m, j, l, i, n, k] +
40.6* A[n, j, l, i, m, k] +
40.7* A[m, j, n, i, l, k] +
40.800000000000004* A[n, j, m, i, l, k] +
40.900000000000006* A[k, l, j, i, m, n] +
41.0* A[k, l, j, i, n, m] +
41.1* A[k, m, j, i, l, n] +
41.2* A[k, n, j, i, l, m] +
41.300000000000004* A[k, m, j, i, n, l] +
41.400000000000006* A[k, n, j, i, m, l] +
41.5* A[l, k, j, i, m, n] +
41.6* A[l, k, j, i, n, m] +
41.7* A[m, k, j, i, l, n] +
41.800000000000004* A[n, k, j, i, l, m] +
41.900000000000006* A[m, k, j, i, n, l] +
42.0* A[n, k, j, i, m, l] +
42.1* A[l, m, j, i, k, n] +
42.2* A[l, n, j, i, k, m] +
42.300000000000004* A[m, l, j, i, k, n] +
42.400000000000006* A[n, l, j, i, k, m] +
42.5* A[m, n, j, i, k, l] +
42.6* A[n, m, j, i, k, l] +
42.7* A[l, m, j, i, n, k] +
42.800000000000004* A[l, n, j, i, m, k] +
42.900000000000006* A[m, l, j, i, n, k] +
43.0* A[n, l, j, i, m, k] +
43.1* A[m, n, j, i, l, k] +
43.2* A[n, m, j, i, l, k] +
43.300000000000004* A[k, l, m, i, j, n] +
43.400000000000006* A[k, l, n, i, j, m] +
43.5* A[k, m, l, i, j, n] +
43.6* A[k, n, l, i, j, m] +
43.7* A[k, m, n, i, j, l] +
43.800000000000004* A[k, n, m, i, j, l] +
43.900000000000006* A[l, k, m, i, j, n] +
44.0* A[l, k, n, i, j, m] +
44.1* A[m, k, l, i, j, n] +
44.2* A[n, k, l, i, j, m] +
44.300000000000004* A[m, k, n, i, j, l] +
44.400000000000006* A[n, k, m, i, j, l] +
44.5* A[l, m, k, i, j, n] +
44.6* A[l, n, k, i, j, m] +
44.7* A[m, l, k, i, j, n] +
44.800000000000004* A[n, l, k, i, j, m] +
44.900000000000006* A[m, n, k, i, j, l] +
45.0* A[n, m, k, i, j, l] +
45.1* A[l, m, n, i, j, k] +
45.2* A[l, n, m, i, j, k] +
45.300000000000004* A[m, l, n, i, j, k] +
45.400000000000006* A[n, l, m, i, j, k] +
45.5* A[m, n, l, i, j, k] +
45.6* A[n, m, l, i, j, k] +
45.7* A[k, l, m, i, n, j] +
45.800000000000004* A[k, l, n, i, m, j] +
45.900000000000006* A[k, m, l, i, n, j] +
46.0* A[k, n, l, i, m, j] +
46.1* A[k, m, n, i, l, j] +
46.2* A[k, n, m, i, l, j] +
46.300000000000004* A[l, k, m, i, n, j] +
46.400000000000006* A[l, k, n, i, m, j] +
46.5* A[m, k, l, i, n, j] +
46.6* A[n, k, l, i, m, j] +
46.7* A[m, k, n, i, l, j] +
46.800000000000004* A[n, k, m, i, l, j] +
46.900000000000006* A[l, m, k, i, n, j] +
47.0* A[l, n, k, i, m, j] +
47.1* A[m, l, k, i, n, j] +
47.2* A[n, l, k, i, m, j] +
47.300000000000004* A[m, n, k, i, l, j] +
47.400000000000006* A[n, m, k, i, l, j] +
47.5* A[l, m, n, i, k, j] +
47.6* A[l, n, m, i, k, j] +
47.7* A[m, l, n, i, k, j] +
47.800000000000004* A[n, l, m, i, k, j] +
47.900000000000006* A[m, n, l, i, k, j] +
48.0* A[n, m, l, i, k, j] +
48.1* A[j, k, l, m, i, n] +
48.2* A[j, k, l, n, i, m] +
48.300000000000004* A[j, k, m, l, i, n] +
48.400000000000006* A[j, k, n, l, i, m] +
48.5* A[j, k, m, n, i, l] +
48.6* A[j, k, n, m, i, l] +
48.7* A[j, l, k, m, i, n] +
48.800000000000004* A[j, l, k, n, i, m] +
48.900000000000006* A[j, m, k, l, i, n] +
49.0* A[j, n, k, l, i, m] +
49.1* A[j, m, k, n, i, l] +
49.2* A[j, n, k, m, i, l] +
49.300000000000004* A[j, l, m, k, i, n] +
49.400000000000006* A[j, l, n, k, i, m] +
49.5* A[j, m, l, k, i, n] +
49.6* A[j, n, l, k, i, m] +
49.7* A[j, m, n, k, i, l] +
49.800000000000004* A[j, n, m, k, i, l] +
49.900000000000006* A[j, l, m, n, i, k] +
50.0* A[j, l, n, m, i, k] +
50.1* A[j, m, l, n, i, k] +
50.2* A[j, n, l, m, i, k] +
50.300000000000004* A[j, m, n, l, i, k] +
50.400000000000006* A[j, n, m, l, i, k] +
50.5* A[k, j, l, m, i, n] +
50.6* A[k, j, l, n, i, m] +
50.7* A[k, j, m, l, i, n] +
50.800000000000004* A[k, j, n, l, i, m] +
50.900000000000006* A[k, j, m, n, i, l] +
51.0* A[k, j, n, m, i, l] +
51.1* A[l, j, k, m, i, n] +
51.2* A[l, j, k, n, i, m] +
51.300000000000004* A[m, j, k, l, i, n] +
51.400000000000006* A[n, j, k, l, i, m] +
51.5* A[m, j, k, n, i, l] +
51.6* A[n, j, k, m, i, l] +
51.7* A[l, j, m, k, i, n] +
51.800000000000004* A[l, j, n, k, i, m] +
51.900000000000006* A[m, j, l, k, i, n] +
52.0* A[n, j, l, k, i, m] +
52.1* A[m, j, n, k, i, l] +
52.2* A[n, j, m, k, i, l] +
52.300000000000004* A[l, j, m, n, i, k] +
52.400000000000006* A[l, j, n, m, i, k] +
52.5* A[m, j, l, n, i, k] +
52.6* A[n, j, l, m, i, k] +
52.7* A[m, j, n, l, i, k] +
52.800000000000004* A[n, j, m, l, i, k] +
52.900000000000006* A[k, l, j, m, i, n] +
53.0* A[k, l, j, n, i, m] +
53.1* A[k, m, j, l, i, n] +
53.2* A[k, n, j, l, i, m] +
53.300000000000004* A[k, m, j, n, i, l] +
53.400000000000006* A[k, n, j, m, i, l] +
53.5* A[l, k, j, m, i, n] +
53.6* A[l, k, j, n, i, m] +
53.7* A[m, k, j, l, i, n] +
53.800000000000004* A[n, k, j, l, i, m] +
53.900000000000006* A[m, k, j, n, i, l] +
54.0* A[n, k, j, m, i, l] +
54.1* A[l, m, j, k, i, n] +
54.2* A[l, n, j, k, i, m] +
54.300000000000004* A[m, l, j, k, i, n] +
54.400000000000006* A[n, l, j, k, i, m] +
54.5* A[m, n, j, k, i, l] +
54.6* A[n, m, j, k, i, l] +
54.7* A[l, m, j, n, i, k] +
54.800000000000004* A[l, n, j, m, i, k] +
54.900000000000006* A[m, l, j, n, i, k] +
55.0* A[n, l, j, m, i, k] +
55.1* A[m, n, j, l, i, k] +
55.2* A[n, m, j, l, i, k] +
55.300000000000004* A[k, l, m, j, i, n] +
55.400000000000006* A[k, l, n, j, i, m] +
55.5* A[k, m, l, j, i, n] +
55.6* A[k, n, l, j, i, m] +
55.7* A[k, m, n, j, i, l] +
55.800000000000004* A[k, n, m, j, i, l] +
55.900000000000006* A[l, k, m, j, i, n] +
56.0* A[l, k, n, j, i, m] +
56.1* A[m, k, l, j, i, n] +
56.2* A[n, k, l, j, i, m] +
56.300000000000004* A[m, k, n, j, i, l] +
56.400000000000006* A[n, k, m, j, i, l] +
56.5* A[l, m, k, j, i, n] +
56.6* A[l, n, k, j, i, m] +
56.7* A[m, l, k, j, i, n] +
56.800000000000004* A[n, l, k, j, i, m] +
56.900000000000006* A[m, n, k, j, i, l] +
57.0* A[n, m, k, j, i, l] +
57.1* A[l, m, n, j, i, k] +
57.2* A[l, n, m, j, i, k] +
57.300000000000004* A[m, l, n, j, i, k] +
57.400000000000006* A[n, l, m, j, i, k] +
57.5* A[m, n, l, j, i, k] +
57.6* A[n, m, l, j, i, k] +
57.7* A[k, l, m, n, i, j] +
57.800000000000004* A[k, l, n, m, i, j] +
57.900000000000006* A[k, m, l, n, i, j] +
58.0* A[k, n, l, m, i, j] +
58.1* A[k, m, n, l, i, j] +
58.2* A[k, n, m, l, i, j] +
58.300000000000004* A[l, k, m, n, i, j] +
58.400000000000006* A[l, k, n, m, i, j] +
58.5* A[m, k, l, n, i, j] +
58.6* A[n, k, l, m, i, j] +
58.7* A[m, k, n, l, i, j] +
58.800000000000004* A[n, k, m, l, i, j] +
58.900000000000006* A[l, m, k, n, i, j] +
59.0* A[l, n, k, m, i, j] +
59.1* A[m, l, k, n, i, j] +
59.2* A[n, l, k, m, i, j] +
59.300000000000004* A[m, n, k, l, i, j] +
59.400000000000006* A[n, m, k, l, i, j] +
59.5* A[l, m, n, k, i, j] +
59.6* A[l, n, m, k, i, j] +
59.7* A[m, l, n, k, i, j] +
59.800000000000004* A[n, l, m, k, i, j] +
59.900000000000006* A[m, n, l, k, i, j] +
60.0* A[n, m, l, k, i, j] +
60.1* A[j, k, l, m, n, i] +
60.2* A[j, k, l, n, m, i] +
60.300000000000004* A[j, k, m, l, n, i] +
60.400000000000006* A[j, k, n, l, m, i] +
60.5* A[j, k, m, n, l, i] +
60.6* A[j, k, n, m, l, i] +
60.7* A[j, l, k, m, n, i] +
60.800000000000004* A[j, l, k, n, m, i] +
60.900000000000006* A[j, m, k, l, n, i] +
61.0* A[j, n, k, l, m, i] +
61.1* A[j, m, k, n, l, i] +
61.2* A[j, n, k, m, l, i] +
61.300000000000004* A[j, l, m, k, n, i] +
61.400000000000006* A[j, l, n, k, m, i] +
61.5* A[j, m, l, k, n, i] +
61.6* A[j, n, l, k, m, i] +
61.7* A[j, m, n, k, l, i] +
61.800000000000004* A[j, n, m, k, l, i] +
61.900000000000006* A[j, l, m, n, k, i] +
62.0* A[j, l, n, m, k, i] +
62.1* A[j, m, l, n, k, i] +
62.2* A[j, n, l, m, k, i] +
62.300000000000004* A[j, m, n, l, k, i] +
62.400000000000006* A[j, n, m, l, k, i] +
62.5* A[k, j, l, m, n, i] +
62.6* A[k, j, l, n, m, i] +
62.7* A[k, j, m, l, n, i] +
62.800000000000004* A[k, j, n, l, m, i] +
62.900000000000006* A[k, j, m, n, l, i] +
63.0* A[k, j, n, m, l, i] +
63.1* A[l, j, k, m, n, i] +
63.2* A[l, j, k, n, m, i] +
63.300000000000004* A[m, j, k, l, n, i] +
63.400000000000006* A[n, j, k, l, m, i] +
63.5* A[m, j, k, n, l, i] +
63.6* A[n, j, k, m, l, i] +
63.7* A[l, j, m, k, n, i] +
63.800000000000004* A[l, j, n, k, m, i] +
63.900000000000006* A[m, j, l, k, n, i] +
64.0* A[n, j, l, k, m, i] +
64.10000000000001* A[m, j, n, k, l, i] +
64.2* A[n, j, m, k, l, i] +
64.3* A[l, j, m, n, k, i] +
64.4* A[l, j, n, m, k, i] +
64.5* A[m, j, l, n, k, i] +
64.60000000000001* A[n, j, l, m, k, i] +
64.7* A[m, j, n, l, k, i] +
64.8* A[n, j, m, l, k, i] +
64.9* A[k, l, j, m, n, i] +
65.0* A[k, l, j, n, m, i] +
65.10000000000001* A[k, m, j, l, n, i] +
65.2* A[k, n, j, l, m, i] +
65.3* A[k, m, j, n, l, i] +
65.4* A[k, n, j, m, l, i] +
65.5* A[l, k, j, m, n, i] +
65.60000000000001* A[l, k, j, n, m, i] +
65.7* A[m, k, j, l, n, i] +
65.8* A[n, k, j, l, m, i] +
65.9* A[m, k, j, n, l, i] +
66.0* A[n, k, j, m, l, i] +
66.10000000000001* A[l, m, j, k, n, i] +
66.2* A[l, n, j, k, m, i] +
66.3* A[m, l, j, k, n, i] +
66.4* A[n, l, j, k, m, i] +
66.5* A[m, n, j, k, l, i] +
66.60000000000001* A[n, m, j, k, l, i] +
66.7* A[l, m, j, n, k, i] +
66.8* A[l, n, j, m, k, i] +
66.9* A[m, l, j, n, k, i] +
67.0* A[n, l, j, m, k, i] +
67.10000000000001* A[m, n, j, l, k, i] +
67.2* A[n, m, j, l, k, i] +
67.3* A[k, l, m, j, n, i] +
67.4* A[k, l, n, j, m, i] +
67.5* A[k, m, l, j, n, i] +
67.60000000000001* A[k, n, l, j, m, i] +
67.7* A[k, m, n, j, l, i] +
67.8* A[k, n, m, j, l, i] +
67.9* A[l, k, m, j, n, i] +
68.0* A[l, k, n, j, m, i] +
68.10000000000001* A[m, k, l, j, n, i] +
68.2* A[n, k, l, j, m, i] +
68.3* A[m, k, n, j, l, i] +
68.4* A[n, k, m, j, l, i] +
68.5* A[l, m, k, j, n, i] +
68.60000000000001* A[l, n, k, j, m, i] +
68.7* A[m, l, k, j, n, i] +
68.8* A[n, l, k, j, m, i] +
68.9* A[m, n, k, j, l, i] +
69.0* A[n, m, k, j, l, i] +
69.10000000000001* A[l, m, n, j, k, i] +
69.2* A[l, n, m, j, k, i] +
69.3* A[m, l, n, j, k, i] +
69.4* A[n, l, m, j, k, i] +
69.5* A[m, n, l, j, k, i] +
69.60000000000001* A[n, m, l, j, k, i] +
69.7* A[k, l, m, n, j, i] +
69.8* A[k, l, n, m, j, i] +
69.9* A[k, m, l, n, j, i] +
70.0* A[k, n, l, m, j, i] +
70.10000000000001* A[k, m, n, l, j, i] +
70.2* A[k, n, m, l, j, i] +
70.3* A[l, k, m, n, j, i] +
70.4* A[l, k, n, m, j, i] +
70.5* A[m, k, l, n, j, i] +
70.60000000000001* A[n, k, l, m, j, i] +
70.7* A[m, k, n, l, j, i] +
70.8* A[n, k, m, l, j, i] +
70.9* A[l, m, k, n, j, i] +
71.0* A[l, n, k, m, j, i] +
71.10000000000001* A[m, l, k, n, j, i] +
71.2* A[n, l, k, m, j, i] +
71.3* A[m, n, k, l, j, i] +
71.4* A[n, m, k, l, j, i] +
71.5* A[l, m, n, k, j, i] +
71.60000000000001* A[l, n, m, k, j, i] +
71.7* A[m, l, n, k, j, i] +
71.8* A[n, l, m, k, j, i] +
71.9* A[m, n, l, k, j, i] +
72.0* A[n, m, l, k, j, i]

@btime test10(A) setup=(n=10; A=rand(n,n,n,n,n,n))
#println( B[1,1,1,1,1,1]/A[1,1,1,1,1,1])

700 terms is a lot. Maybe you want something more like this:

julia> begin
         B = zero.(A)
         for (i,p) in enumerate(permutations(1:ndims(A)))
           B .+= i/10 .* PermutedDimsArray(A, Tuple(p))
         end
         B
       end
2×2×2×2 Array{Float64, 4}:
1 Like

Yes. That I tried on the #17th reply, which appears in test3(A) = 0.1 .* A .+ 0.2 .* PermutedDimsArray(A, (2,1,3,4)) in 14th reply. I would like to try the performance from @tullio and @tensor, but did not figure out an elegant way to do so. (@tullio crashed)

Oh right, sorry, that’s very similar.

Not sure exactly what the error is, but note that it’s not from Tullio, it’s from LoopVectorization. Tullio alone seems to run, but calling +(x...) with > 700 arguments is asking for problems – many things handling tuples stop being efficient at 32.

The whole thing seems a bit X-Y problem though. Why do you want to add all the permutations of a 6D-array? If you must, it’s linear, so you can just save the coefficients to disk and do it as one matrix-vector multiplication; at size n=3 that’s 10^5 times faster.

2 Likes

I see. After removed LoopVectorization, Tullio takes forever…

I thought about defining a bigger array, take 4-dimensional case for example, A[i,:,:,:,:] . For each i, it corresponds to the ith permutation. Thus, the problem can be converted into
A[i,:,:,:,:]* P[i] = B[:,:,:,:] , where P[i] is the array of factors 0.1, 0.2, … corresponds to each permutation, as kind of matrix-vector multiplication. (sum over i)

I tried Fortran and Python. The thing is,the matrix-vector multiplication is faster, but building the bigger array takes longer time. In total, the time is similar to the direct summation over permutation. In Julia, I met some issues. Unable to do so. Maybe there is some trick to make it faster. I really want to see 10^5 faster

Here is my Python code

import numpy as np
import time
import itertools as it
import opt_einsum as oe

ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)


n_loop = 2
na = nb = nc = nd = 30

A = np.random.random((na,nb,nc,nd))

factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0

for n in range(n_loop):
    sum_A = np.zeros((na,nb,nc,nd))
    start_0 = time.time()
    for m, t in enumerate(transpose_list):
       if abs(factor_list[m]) < 1.e-3:
           continue 
       np.add(sum_A, factor_list[m]  * np.transpose(A, transpose_list[m] ), out = sum_A) 

    finish_0 = time.time()
    time_total += finish_0 - start_0

print('time for permu addition', time_total/n_loop) 


n_factor = 24
total_array = np.zeros((n_factor,na,nb,nc,nd))
factor_array = np.asarray(factor_list)


time_total = 0.0
for n in range(n_loop):
  start_0 = time.time()
  for m in range(n_factor):
    total_array[m,:,:,:,:] = np.transpose(A, transpose_list[m] )

  factor_array = np.asarray(factor_list)

  oe.contract('nijkl,n->ijkl', total_array, factor_array)
  finish_0 = time.time()    
  time_total += finish_0 - start_0

print('time for einsum',time_total/n_loop)

I got

time for permu addition 0.10860311985015869
time for einsum 0.14762508869171143

Here is my Julia code

using BenchmarkTools, Tullio, TensorOperations, Combinatorics, LoopVectorization

function perms(a) 
    B = collect(permutations(a))
    B
end

function einsum(n)
    @tullio C[i,j,k,l] := factor_p[n]  * A[n, i, j, k, l]  
end    


thresh = 0.0001
P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

n_factor = 24
factor_p = [0.1:0.1:2.4;]


total_A = zeros(n_factor, n, n, n, n)
for i = 1:n_factor
    @tullio total_A[i,j,k,l,m] = PermutedDimsArray(A, P[i])[j,k,l,m]
end    

@btime einsum(n_factor)

I got

ERROR: LoadError: "expected a 5-array A"
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Tullio/wAFFh/src/macro.jl:983 [inlined]
...

Sorry for being late to the party, but if you write it using views, broadcasting and PermutedDimsArray (or just permutedims), but put @strided from Strided.jl in front, you should get a decent speed up. Strided.jl is what speeds up the permutations in TensorOperations.jl, but can be used in itself using the @strided macro, which should combine well with most broadcasting expressions.

Furthermore, Strided.jl supports multithreading, so if you launch Julia with multiple threads, you could get some further speedup (although in the end this kind of operation is bandwidth limited rather than compute limited).

Thanks. I tried a bit.

using BenchmarkTools, Tullio, TensorOperations, Combinatorics, Strided

function perms(a)
B = collect(permutations(a))
B
end

function perm_add_4(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
sum_4 = sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end

function perm_add_4_stride(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 = sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end

P = perms([1,2,3,4])
factor_p = [0.1:0.1:2.4;]
n = 30
A = rand(n, n, n, n)

@btime perm_add_4(n)
@btime perm_add_4_stride(n)

I got

75.302 ms (315 allocations: 154.51 MiB)
81.919 ms (531 allocations: 154.53 MiB)

by julia 1.7.3
not much improvement :frowning:

Note that you have many allocations in both functions. The reason is that, in your loop, you are not storing the result in sum4 but allocating a new sum4 array every time, because you forgot a dot before the equal sign. Without this, broadcasting is used for the right hand side, but it is then materialized into a new array, which is then called sum4. So I guess you want

function perm_add_4(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
sum_4 .= sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end
function perm_add_4_stride(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* PermutedDimsArray(A, P[i])
end
end

With these, I get, on my computer, using julia -t 4 to use 4 threads:

julia> @btime perm_add_4(n)
  33.952 ms (243 allocations: 6.19 MiB)

julia> @btime perm_add_4_stride(n)
  90.352 ms (2173 allocations: 154.67 MiB)

Apparently there is something wrong with how I deal with PermutedDimsArray, if you make it within the @strided macro. There are two solutions, namely to just use permutedims within the macro call, or to make the PermutedDimsArray before the @strided call. (I will of course also try to fix this).

function perm_add_4_stride_1(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
B = PermutedDimsArray(A, P[i])
@strided sum_4 .= sum_4 .+ factor_p[i] .* B
end
end
function perm_add_4_stride_2(n)
sum_4 = zeros(n, n, n, n)
for i = 1:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* permutedims(A, P[i])
end
end

With these, I find

julia> @btime perm_add_4_stride_1(n)
  19.000 ms (2086 allocations: 6.38 MiB)

julia> @btime perm_add_4_stride_2(n)
  18.646 ms (2062 allocations: 6.38 MiB)

so much better.

Finally, you can speed it up further by ‘unrolling’ the loop, i.e. doing multiple additions at once, so that you do not have to run over the data of sum4 24 times:

function perm_add_4_stride_unrolled(n)
sum_4 = zeros(n, n, n, n)
for i = 1:4:24
@strided sum_4 .= sum_4 .+ factor_p[i] .* permutedims(A, P[i]) .+ factor_p[i+1] .* permutedims(A, P[i+1]) .+ factor_p[i+2] .* permutedims(A, P[i+2]) .+ factor_p[i+3] .* permutedims(A, P[i+3])
end
end

which gives

julia> @btime perm_add_4_stride_unrolled(n)
  10.937 ms (751 allocations: 6.28 MiB)
1 Like

In addition to other problems, you are accessing non-const global variables inside your functions. This is very bad for performance, and is the very first performance tip: Performance Tips · The Julia Language

A, P and factor_p should be input arguments to your functions.

1 Like

Thanks @DNF, these were indeed some extra comments I wanted to add (but forgot) at the end of my post. To further improve performance, now it is indeed down to the general performance tips in terms of avoiding globals etc.

1 Like