Question About Transposes and Matrix Multiplication Speed

Juser · August 22, 2018, 4:18am

I’ve noticed that for Julia 0.7.0 that A*c is faster than A'*c in the following case, when I would have expected that A' would make the align the memory more easily for the computer to perform the matrix multiplication.

A=rand(1_000_000,10);
B=copy(transpose(A));
c=rand(10);

@btime $A*$c
@btime $B'*$c

Does anyone know why this is? Thank you!

mohamed82008 · August 22, 2018, 4:56am

B is 10 x 1000_000_000 so you cannot multiply it by a vector of length 10. Also don’t copy the transpose, you probably want the lazy transpose in this case. And use $ with btime, e.g. @btime $A * $c.

Juser · August 22, 2018, 5:47am

@mohamed82008, thanks for your response. I made a couple typos in transcribing to Discourse. I meant to both include the $ and transpose B in the matrix multiplication (i.e. write @btime $B'*$c , which I have now fixed in the original post).

The issue I’m pointing to is that I would have expected B'*c to be faster than A*c because the matrix multiplication can iterate over columns of B as opposed to rows of A.

DNF · August 22, 2018, 6:15am

I would expect that BLAS is clever enough to know to iterate over A in an optimal fashion.

Juser · August 22, 2018, 6:55am

@DNF, perhaps you’re right and my intuition about the “right” dimension on which to iterate was wrong. Now that I think about it, the best algorithm may not be to perform the matrix multiplication row-by-row on A (i.e. column-by-column on B). It may instead be to iterate over the columns of A and the elements of c. If that’s the case, then my intuition was wrong and it should be faster to store A rather than A’.

DNF · August 22, 2018, 7:58am

Yes, this is my intuition too. I mentally envision a matrix-vector product as splitting the matrix into separate columns, and then using the vector elements to scale each column, like this:

No idea what BLAS actually does, though.

Elrod · August 22, 2018, 1:01pm

Yes, that is basically correct. The columns should really be a multiple of cacheline length though (64 bytes, or 8 Float64s) though (as long as the matrix A has at least 8 rows).
Or at least a multiple of SIMD vector length (ie, 4 on most computers).

Common kernels calculate 8x6 blocks from 8xN * Nx6 inputs at a time.
Pre-avx512 computers have 16 registers, so you can hold the 8x6 answer in a total of 12 registers, 2 registers dedicated to a column of the first matrix at a time using 14 registers. Finally, iterate over the 6 columns of the right side matrix, broadcasting scalars and updating the six columns of the product with fma instructions.

avx512 is much faster for matrix multiplication because not only are your registers twice the size, but there are twice as many (32), meaning you get way more operations in before having to do a lot of swapping of register contents.

Juser · August 22, 2018, 2:00pm

@Elrod and @DNF, thanks! Mystery solved: it’s incorrect to rely on the assumption that BLAS is going to follow the algorithm you learned in grade school.

tomtom · August 22, 2018, 3:07pm

I’m using functions like At_mul_B() for case like this.

Just hope that these “shortcuts” (most of them are inline functions actually) has chosen the fastest way already.

Seif_Shebl · August 22, 2018, 6:00pm

This was removed in 1.0 in favor of the generic mul!(C,A,B,).

 julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0]; Y = similar(B); mul!(Y, A, B);

  julia> Y
  2×2 Array{Float64,2}:
   3.0  3.0
   7.0  7.0

Topic		Replies	Views
Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector` Performance blas	2	692	September 27, 2019
Effect of Column Major Storage on Multiplication Performance	3	446	November 7, 2020
Naive matrix multiplication is super slow in Julia 1.0? Performance	12	3304	August 17, 2018
Multiplication after transpose much faster than multiplication after PermutedDimsArray Performance linearalgebra	3	1636	April 10, 2019
Comparing matrix multiplication between Julia and C Performance question , matrices	2	896	July 7, 2020

Question About Transposes and Matrix Multiplication Speed

Related topics