I have been experimenting with matrix multiplication with transpose or PermutedDimsArray functions. My goal was to find the cost for multiplying a transposed array, and I encountered a behavior of transpose I cannot explain.
My initial hypothesis was this: In Julia, a matrix A is stored in column-major order. Therefore, the operation Ax is faster than A^Tx, since the latter needs to multiply a non-contiguous set of indices. (c.f.https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Access-arrays-in-memory-order,-along-columns-1)
To test this hypothesis, I implemented four versions of matrix multiply: standard multiplication, PermutedDimsArray, PermutedDimsArray+copy, and transpose. The code is as follows.
function normal_multiply(A,x) #normal multiplication used as reference At = PermutedDimsArray(A,(2,1)) return A*x end function PermutedDimsArray_multiply(A,x) #permuted multiplication. It creates a view At = PermutedDimsArray(A,(2,1)) return At*x end function copy_multiply(A,x) #copies the matrix after permutation At = copy(PermutedDimsArray(A,(2,1))) #creates a view and copies it return At*x end function transpose_multiply(A,x) #for linear algebra purposes, transpose is a good way to create a new array At = transpose(A) #creates a new transposed array return At*x end N = 1000 A = rand(N,N) x = rand(N) println("normal_multiply") @btime normal_multiply($A,$x); println("PermutedDimsArray_multiply") @btime PermutedDimsArray_multiply($A,$x); println("copy_multiply") @btime copy_multiply($A,$x); println("transpose_multiply") @btime transpose_multiply($A,$x);
When I benchmarked the above code, I got the following result.
944.228 μs (5 allocations: 8.11 KiB)
7.250 ms (5 allocations: 8.11 KiB)
11.344 ms (7 allocations: 7.64 MiB)
790.721 μs (1 allocation: 7.94 KiB)
Several things to note here.
- permuted version is much slower than the normal_multiply. This is unsurprising due to the non-contiguous data storage.
- permuted+copy is slower than permuted on its own. It seems the speed up from making data contiguous in memory is not large enough compared to the cost of copy. (c.f. https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Copying-data-is-not-always-bad-1)
- transpose is as fast as normal_multiply
I am very surprised by the last result. I thought it would cost as much as PermutedDimsArray version did, but somehow transpose(A)x is as fast as Ax. I don’t understand why this is possible, and would like to know some explanation for this. My guess is it has something to do with the fact that transpose(A) can act as a lazy wrapper, but I do not know how that works in practice.
Thanks in advance!