How to achieve the greatest speed with nested loops?

It helps if the vectorized dimension – X’s first – is a relatively round multiple of your computer’s SIMD vector width. You should see good performance with 24, for example.

julia> Y1 = zeros(24,24,24); Y2 = similar(Y1);

julia> X1 = randn(100,24,24);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!($Y1, $X1);
  47.458 μs (0 allocations: 0 bytes)

julia> @btime eight!($Y2, $X2);
  22.203 μs (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (47.458e-6, 22.203e-6)
(58.25782797420877, 124.52371301175516)

Just making the dimension bigger would help, in that the remainder would be a smaller % of the overall size, so 50 may be better than 20.

There are a lot of things at play. If we just make the first axis of X a lot bigger, the permuted version slows down by a lot:

julia> X1 = randn(1000,24,24);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!($Y1, $X1);
  457.149 μs (0 allocations: 0 bytes)

julia> @btime eight!($Y2, $X2);
  327.771 μs (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (457.149e-6, 327.771e-6)
(60.47918731092051, 84.35157472747741)

This is because it moves across that axis very quickly, and starts starving for memory. That axis is also no longer contiguous, making it worse to move across quickly that it is for the non-permuted form.

It’s not that bad yet on my computer with the 24x24 or even 48x48 sizes. But 50 adds on two more problems:

julia> Y1 = zeros(50,50,50); Y2 = similar(Y1);

julia> X1 = randn(1000,50,50);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!($Y1, $X1);
  3.299 ms (0 allocations: 0 bytes)

julia> @btime eight!($Y2, $X2);
  3.424 ms (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (3.299e-3, 3.424e-3)
(75.78053955744167, 73.0140186915888)

Those are memory alignment, that those axes it has to sum across are now further apart (separated by 50 instead of 24 elements), and that 50 % vector width is bigger than it is for 24 or 48 on my computer.

For the permuted example, memory alignment isn’t normally that big a deal. But it is for the unpermuted case, where you need to quickly load vectors along both contiguous axes (this is slower on most computers when memory isn’t aligned, which happens for multidimensional arrays when that first axis isn’t a multiple of 64 bytes).
If we make the summed over axis 999 instead of 1000 elements, look how much slower it gets:

julia> X1 = randn(999,50,50);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!($Y1, $X1);
  4.763 ms (0 allocations: 0 bytes)

julia> @btime eight!($Y2, $X2);
  3.462 ms (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (4.763e-3, 3.462e-3)
(52.43543984883477, 72.14038128249567)

Floating point math is not associative, so I check approximate instead of exact equality to allow for rounding differences.