How to speed up permutedims for high dimensional tensors

I find permutedims is so slow, this is something can not be explained by complexity theory.
It is because the permutedims is not fully optimized? it occupies >90% of the computing time of my tensor program.

julia> a = randn(fill(2, 20)...);

julia> using BenchmarkTools

julia> @benchmark reshape(a, 1<<10, 1<<10) * reshape(a, 1<<10, 1<<10)
BenchmarkTools.Trial: 
  memory estimate:  8.00 MiB
  allocs estimate:  6
  --------------
  minimum time:     38.835 ms (0.00% GC)
  median time:      83.084 ms (0.00% GC)
  mean time:        89.822 ms (0.66% GC)
  maximum time:     160.574 ms (0.00% GC)
  --------------
  samples:          56
  evals/sample:     1

julia> using Random

julia> @benchmark permutedims(a, randperm(20))
BenchmarkTools.Trial: 
  memory estimate:  792.00 MiB
  allocs estimate:  25165893
  --------------
  minimum time:     1.380 s (1.78% GC)
  median time:      1.538 s (1.74% GC)
  mean time:        1.563 s (4.94% GC)
  maximum time:     1.797 s (12.87% GC)
  --------------
  samples:          4
  evals/sample:     1

Ah, the allocations suggests it might be related to tuple splatting,

julia> a = randn(fill(2, 14)...);

julia> @benchmark permutedims($a, $(randperm(14)))
BenchmarkTools.Trial: 
  memory estimate:  128.77 KiB
  allocs estimate:  8
  --------------
  minimum time:     77.574 μs (0.00% GC)
  median time:      83.264 μs (0.00% GC)
  mean time:        84.643 μs (1.13% GC)
  maximum time:     506.565 μs (82.24% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark reshape(a, 1<<7, 1<<7) * reshape(a, 1<<7, 1<<7)
BenchmarkTools.Trial: 
  memory estimate:  128.27 KiB
  allocs estimate:  6
  --------------
  minimum time:     136.216 μs (0.00% GC)
  median time:      210.382 μs (0.00% GC)
  mean time:        240.261 μs (1.67% GC)
  maximum time:     25.140 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I thought it was fixed: https://github.com/JuliaLang/julia/pull/40468
But forgot I have to wait for Julia 1.7.

permutedims is also inherently slow, that’s why I developed Strided.jl and use this in TensorOperations.jl:

julia> using BenchmarkTools, Strided, Random

julia> a = randn(fill(2, 20)...);

julia> b = similar(a);

julia> p = (randperm(20)...,)
(17, 20, 14, 11, 5, 4, 18, 16, 3, 19, 1, 13, 9, 8, 12, 7, 2, 6, 15, 10)

julia> @benchmark permutedims!($b, $a, $p)
BenchmarkTools.Trial: 
  memory estimate:  784.00 MiB
  allocs estimate:  25165886
  --------------
  minimum time:     846.591 ms (2.18% GC)
  median time:      850.013 ms (2.16% GC)
  mean time:        853.072 ms (2.15% GC)
  maximum time:     866.550 ms (1.99% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> @benchmark @strided permutedims!($b, $a, $p)
BenchmarkTools.Trial: 
  memory estimate:  11.00 KiB
  allocs estimate:  105
  --------------
  minimum time:     2.449 ms (0.00% GC)
  median time:      2.643 ms (0.00% GC)
  mean time:        2.669 ms (0.00% GC)
  maximum time:     4.037 ms (0.00% GC)
  --------------
  samples:          1864
  evals/sample:     1
1 Like

Ok, but I see now, beyond 16 there is a huge gap due to the allocations. Below that the gap is much smaller than this.