Thank you so much for pointing out.
The motivation is, from I heard and some tests,
TensorOperations.jl is often the fastest way to do
permutedims
on larger arrays. It has a smarter cache-friendly blocking algorithm than the one in Base. But how much this matters of course depends on size & permutation.
My goal involves a series of array additions and permutations, it is complicated to optimize memory cache to improve the performance in Numpy/Fortran. TensorOperations.jl
has some feature on it.
In one example in the above link, the timing from python is 4.956309795379639 ms
and in julia is 2.585 ms (21 allocations: 6.18 MiB)
, about a factor of 2 better. I observed a similar pattern in other cases.