Multiplication after transpose much faster than multiplication after PermutedDimsArray

tomohiro_soejima · April 10, 2019, 1:12pm

I have been experimenting with matrix multiplication with transpose or PermutedDimsArray functions. My goal was to find the cost for multiplying a transposed array, and I encountered a behavior of transpose I cannot explain.
My initial hypothesis was this: In Julia, a matrix A is stored in column-major order. Therefore, the operation Ax is faster than A^Tx, since the latter needs to multiply a non-contiguous set of indices. (c.f.Performance Tips · The Julia Language)
To test this hypothesis, I implemented four versions of matrix multiply: standard multiplication, PermutedDimsArray, PermutedDimsArray+copy, and transpose. The code is as follows.


function normal_multiply(A,x)
    #normal multiplication used as reference
    At = PermutedDimsArray(A,(2,1)) 
    return A*x
end
    
function PermutedDimsArray_multiply(A,x)
    #permuted multiplication. It creates a view
    At = PermutedDimsArray(A,(2,1)) 
    return At*x
end

function copy_multiply(A,x)
    #copies the matrix after permutation
    At = copy(PermutedDimsArray(A,(2,1))) #creates a view and copies it
    return At*x
end

function transpose_multiply(A,x)
    #for linear algebra purposes, transpose is a good way to create a new array
    At = transpose(A) #creates a new transposed array
    return At*x
end

N = 1000
A = rand(N,N)
x = rand(N)

println("normal_multiply")
@btime normal_multiply($A,$x);
println("PermutedDimsArray_multiply")
@btime PermutedDimsArray_multiply($A,$x);
println("copy_multiply")
@btime copy_multiply($A,$x);
println("transpose_multiply")
@btime transpose_multiply($A,$x);

When I benchmarked the above code, I got the following result.
normal_multiply
944.228 μs (5 allocations: 8.11 KiB)
PermutedDimsArray_multiply
7.250 ms (5 allocations: 8.11 KiB)
copy_multiply
11.344 ms (7 allocations: 7.64 MiB)
transpose_multiply
790.721 μs (1 allocation: 7.94 KiB)

Several things to note here.

permuted version is much slower than the normal_multiply. This is unsurprising due to the non-contiguous data storage.
permuted+copy is slower than permuted on its own. It seems the speed up from making data contiguous in memory is not large enough compared to the cost of copy. (c.f. Performance Tips · The Julia Language)
transpose is as fast as normal_multiply

I am very surprised by the last result. I thought it would cost as much as PermutedDimsArray version did, but somehow transpose(A)x is as fast as Ax. I don’t understand why this is possible, and would like to know some explanation for this. My guess is it has something to do with the fact that transpose(A) can act as a lazy wrapper, but I do not know how that works in practice.

Thanks in advance!

kristoffer.carlsson · April 10, 2019, 2:31pm

These will just en up calling OpenBLAS which has optimized versions for both transpose and non-transpose matrix times vector.

mbauman · April 10, 2019, 3:29pm

I’m afraid the “why” here isn’t going to be very satisfying. It boils down to the fact that LinearAlgebra has specialized multiplication methods for ::Transposes but not for ::PermutedDimsArrays. The latter could indeed use the same specialized methods in this particular case, but we’re simply missing those specializations.

In short: we don’t (yet) have a good way to dispatch on array storage type — and BLAS requires a very particular sort of array storage (that is, strided) to use its highly optimized methods.

tomohiro_soejima · April 10, 2019, 7:45pm

Hi,

Thanks for the replies. I understand it has to do with OpenBLAS implementation.
As a related side question, is there a plan to develop a way to dispatch on array storage types in the future?

Topic		Replies	Views
Question About Transposes and Matrix Multiplication Speed Numerics	9	2077	August 22, 2018
Comparing matrix multiplication between Julia and C Performance question , matrices	2	882	July 7, 2020
PermutedDimsArray slower than permutedims? New to Julia performance , blas	3	1567	September 10, 2020
Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector` Performance blas	2	680	September 27, 2019
Effect of Column Major Storage on Multiplication Performance	3	434	November 7, 2020

Multiplication after transpose much faster than multiplication after PermutedDimsArray

Related topics