# Multiplication after transpose much faster than multiplication after PermutedDimsArray

I have been experimenting with matrix multiplication with transpose or PermutedDimsArray functions. My goal was to find the cost for multiplying a transposed array, and I encountered a behavior of transpose I cannot explain.
My initial hypothesis was this: In Julia, a matrix A is stored in column-major order. Therefore, the operation Ax is faster than A^Tx, since the latter needs to multiply a non-contiguous set of indices. (c.f.https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Access-arrays-in-memory-order,-along-columns-1)
To test this hypothesis, I implemented four versions of matrix multiply: standard multiplication, PermutedDimsArray, PermutedDimsArray+copy, and transpose. The code is as follows.

``````
function normal_multiply(A,x)
#normal multiplication used as reference
At = PermutedDimsArray(A,(2,1))
return A*x
end

function PermutedDimsArray_multiply(A,x)
#permuted multiplication. It creates a view
At = PermutedDimsArray(A,(2,1))
return At*x
end

function copy_multiply(A,x)
#copies the matrix after permutation
At = copy(PermutedDimsArray(A,(2,1))) #creates a view and copies it
return At*x
end

function transpose_multiply(A,x)
#for linear algebra purposes, transpose is a good way to create a new array
At = transpose(A) #creates a new transposed array
return At*x
end

N = 1000
A = rand(N,N)
x = rand(N)

println("normal_multiply")
@btime normal_multiply(\$A,\$x);
println("PermutedDimsArray_multiply")
@btime PermutedDimsArray_multiply(\$A,\$x);
println("copy_multiply")
@btime copy_multiply(\$A,\$x);
println("transpose_multiply")
@btime transpose_multiply(\$A,\$x);

``````

When I benchmarked the above code, I got the following result.
normal_multiply
944.228 ÎĽs (5 allocations: 8.11 KiB)
PermutedDimsArray_multiply
7.250 ms (5 allocations: 8.11 KiB)
copy_multiply
11.344 ms (7 allocations: 7.64 MiB)
transpose_multiply
790.721 ÎĽs (1 allocation: 7.94 KiB)

Several things to note here.

1. permuted version is much slower than the normal_multiply. This is unsurprising due to the non-contiguous data storage.
2. permuted+copy is slower than permuted on its own. It seems the speed up from making data contiguous in memory is not large enough compared to the cost of copy. (c.f. https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Copying-data-is-not-always-bad-1)
3. transpose is as fast as normal_multiply

I am very surprised by the last result. I thought it would cost as much as PermutedDimsArray version did, but somehow transpose(A)x is as fast as Ax. I donâ€™t understand why this is possible, and would like to know some explanation for this. My guess is it has something to do with the fact that transpose(A) can act as a lazy wrapper, but I do not know how that works in practice.

Iâ€™m afraid the â€śwhyâ€ť here isnâ€™t going to be very satisfying. It boils down to the fact that LinearAlgebra has specialized multiplication methods for `::Transpose`s but not for `::PermutedDimsArray`s. The latter could indeed use the same specialized methods in this particular case, but weâ€™re simply missing those specializations.