Multiplication after transpose much faster than multiplication after PermutedDimsArray

I have been experimenting with matrix multiplication with transpose or PermutedDimsArray functions. My goal was to find the cost for multiplying a transposed array, and I encountered a behavior of transpose I cannot explain.
My initial hypothesis was this: In Julia, a matrix A is stored in column-major order. Therefore, the operation Ax is faster than A^Tx, since the latter needs to multiply a non-contiguous set of indices. (c.f.,-along-columns-1)
To test this hypothesis, I implemented four versions of matrix multiply: standard multiplication, PermutedDimsArray, PermutedDimsArray+copy, and transpose. The code is as follows.

function normal_multiply(A,x)
    #normal multiplication used as reference
    At = PermutedDimsArray(A,(2,1)) 
    return A*x
function PermutedDimsArray_multiply(A,x)
    #permuted multiplication. It creates a view
    At = PermutedDimsArray(A,(2,1)) 
    return At*x

function copy_multiply(A,x)
    #copies the matrix after permutation
    At = copy(PermutedDimsArray(A,(2,1))) #creates a view and copies it
    return At*x

function transpose_multiply(A,x)
    #for linear algebra purposes, transpose is a good way to create a new array
    At = transpose(A) #creates a new transposed array
    return At*x

N = 1000
A = rand(N,N)
x = rand(N)

@btime normal_multiply($A,$x);
@btime PermutedDimsArray_multiply($A,$x);
@btime copy_multiply($A,$x);
@btime transpose_multiply($A,$x);

When I benchmarked the above code, I got the following result.
944.228 μs (5 allocations: 8.11 KiB)
7.250 ms (5 allocations: 8.11 KiB)
11.344 ms (7 allocations: 7.64 MiB)
790.721 μs (1 allocation: 7.94 KiB)

Several things to note here.

  1. permuted version is much slower than the normal_multiply. This is unsurprising due to the non-contiguous data storage.
  2. permuted+copy is slower than permuted on its own. It seems the speed up from making data contiguous in memory is not large enough compared to the cost of copy. (c.f.
  3. transpose is as fast as normal_multiply

I am very surprised by the last result. I thought it would cost as much as PermutedDimsArray version did, but somehow transpose(A)x is as fast as Ax. I don’t understand why this is possible, and would like to know some explanation for this. My guess is it has something to do with the fact that transpose(A) can act as a lazy wrapper, but I do not know how that works in practice.

Thanks in advance!

These will just en up calling OpenBLAS which has optimized versions for both transpose and non-transpose matrix times vector.

I’m afraid the “why” here isn’t going to be very satisfying. It boils down to the fact that LinearAlgebra has specialized multiplication methods for ::Transposes but not for ::PermutedDimsArrays. The latter could indeed use the same specialized methods in this particular case, but we’re simply missing those specializations.

In short: we don’t (yet) have a good way to dispatch on array storage type — and BLAS requires a very particular sort of array storage (that is, strided) to use its highly optimized methods.



Thanks for the replies. I understand it has to do with OpenBLAS implementation.
As a related side question, is there a plan to develop a way to dispatch on array storage types in the future?