I have been experimenting with matrix multiplication with transpose or PermutedDimsArray functions. My goal was to find the cost for multiplying a transposed array, and I encountered a behavior of transpose I cannot explain.

My initial hypothesis was this: In Julia, a matrix A is stored in column-major order. Therefore, the operation Ax is faster than A^Tx, since the latter needs to multiply a non-contiguous set of indices. (c.f.https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Access-arrays-in-memory-order,-along-columns-1)

To test this hypothesis, I implemented four versions of matrix multiply: standard multiplication, PermutedDimsArray, PermutedDimsArray+copy, and transpose. The code is as follows.

```
function normal_multiply(A,x)
#normal multiplication used as reference
At = PermutedDimsArray(A,(2,1))
return A*x
end
function PermutedDimsArray_multiply(A,x)
#permuted multiplication. It creates a view
At = PermutedDimsArray(A,(2,1))
return At*x
end
function copy_multiply(A,x)
#copies the matrix after permutation
At = copy(PermutedDimsArray(A,(2,1))) #creates a view and copies it
return At*x
end
function transpose_multiply(A,x)
#for linear algebra purposes, transpose is a good way to create a new array
At = transpose(A) #creates a new transposed array
return At*x
end
N = 1000
A = rand(N,N)
x = rand(N)
println("normal_multiply")
@btime normal_multiply($A,$x);
println("PermutedDimsArray_multiply")
@btime PermutedDimsArray_multiply($A,$x);
println("copy_multiply")
@btime copy_multiply($A,$x);
println("transpose_multiply")
@btime transpose_multiply($A,$x);
```

When I benchmarked the above code, I got the following result.

normal_multiply

944.228 ÎĽs (5 allocations: 8.11 KiB)

PermutedDimsArray_multiply

7.250 ms (5 allocations: 8.11 KiB)

copy_multiply

11.344 ms (7 allocations: 7.64 MiB)

transpose_multiply

790.721 ÎĽs (1 allocation: 7.94 KiB)

Several things to note here.

- permuted version is much slower than the normal_multiply. This is unsurprising due to the non-contiguous data storage.
- permuted+copy is slower than permuted on its own. It seems the speed up from making data contiguous in memory is not large enough compared to the cost of copy. (c.f. https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Copying-data-is-not-always-bad-1)
- transpose is as fast as normal_multiply

I am very surprised by the last result. I thought it would cost as much as PermutedDimsArray version did, but somehow transpose(A)*x is as fast as A*x. I donâ€™t understand why this is possible, and would like to know some explanation for this. My guess is it has something to do with the fact that transpose(A) can act as a lazy wrapper, but I do not know how that works in practice.

Thanks in advance!