Efficiency for calling Julia from python and purely run Julia

Thank you so much for pointing out.

The motivation is, from I heard and some tests,

TensorOperations.jl is often the fastest way to do permutedims on larger arrays. It has a smarter cache-friendly blocking algorithm than the one in Base. But how much this matters of course depends on size & permutation.

My goal involves a series of array additions and permutations, it is complicated to optimize memory cache to improve the performance in Numpy/Fortran. TensorOperations.jl has some feature on it.

In one example in the above link, the timing from python is 4.956309795379639 ms and in julia is 2.585 ms (21 allocations: 6.18 MiB), about a factor of 2 better. I observed a similar pattern in other cases.