To put numbers on that:
julia> @btime mult1!($c, $A, $x, $L); # as above, my computer
13.159 ms (800 allocations: 34.38 KiB)
julia> using NNlib # to do a different computation!
julia> @btime batched_mul!(reshape($c,$L,1,$L), $A, reshape($x,$L,1,$L));
1.985 ms (39 allocations: 3.23 KiB)
julia> @btime permutedims!($(similar(A)), $A, (2,3,1)); # cost of re-ordering
13.308 ms (0 allocations: 0 bytes)
Here batched_mul is computing c[a,i] = A[a,b,i] * x[b,i] in which each matrix slice has stride 1. If you cannot re-organise your calculation to have this throughout, then the cost of permuting is quite high, and in this case another approach is:
julia> using Tullio
julia> mult3!(c, A, x, L=0) = @tullio c[i,a] = A[i,a,b] * x[i,b] avx=false; # without LV
julia> @btime mult3!($c, $A, $x, $L);
2.328 ms (49 allocations: 2.53 KiB)
julia> using LoopVectorization
julia> mult4!(c, A, x, L=0) = @tullio c[i,a] = A[i,a,b] * x[i,b]; # avx=true default LV loaded
julia> @btime mult4!($c, $A, $x, $L);
1.565 ms (49 allocations: 2.53 KiB)