I find permutedims is so slow, this is something can not be explained by complexity theory.
It is because the permutedims is not fully optimized? it occupies >90% of the computing time of my tensor program.
julia> a = randn(fill(2, 20)...);
julia> using BenchmarkTools
julia> @benchmark reshape(a, 1<<10, 1<<10) * reshape(a, 1<<10, 1<<10)
BenchmarkTools.Trial:
memory estimate: 8.00 MiB
allocs estimate: 6
--------------
minimum time: 38.835 ms (0.00% GC)
median time: 83.084 ms (0.00% GC)
mean time: 89.822 ms (0.66% GC)
maximum time: 160.574 ms (0.00% GC)
--------------
samples: 56
evals/sample: 1
julia> using Random
julia> @benchmark permutedims(a, randperm(20))
BenchmarkTools.Trial:
memory estimate: 792.00 MiB
allocs estimate: 25165893
--------------
minimum time: 1.380 s (1.78% GC)
median time: 1.538 s (1.74% GC)
mean time: 1.563 s (4.94% GC)
maximum time: 1.797 s (12.87% GC)
--------------
samples: 4
evals/sample: 1