Question About Transposes and Matrix Multiplication Speed

I’ve noticed that for Julia 0.7.0 that A*c is faster than A'*c in the following case, when I would have expected that A' would make the align the memory more easily for the computer to perform the matrix multiplication.

A=rand(1_000_000,10);
B=copy(transpose(A));
c=rand(10);

@btime $A*$c
@btime $B'*$c

Does anyone know why this is? Thank you!

B is 10 x 1000_000_000 so you cannot multiply it by a vector of length 10. Also don’t copy the transpose, you probably want the lazy transpose in this case. And use $ with btime, e.g. @btime $A * $c.

1 Like

@mohamed82008, thanks for your response. I made a couple typos in transcribing to Discourse. I meant to both include the $ and transpose B in the matrix multiplication (i.e. write @btime $B'*$c, which I have now fixed in the original post).

The issue I’m pointing to is that I would have expected B'*c to be faster than A*c because the matrix multiplication can iterate over columns of B as opposed to rows of A.

I would expect that BLAS is clever enough to know to iterate over A in an optimal fashion.

1 Like

@DNF, perhaps you’re right and my intuition about the “right” dimension on which to iterate was wrong. Now that I think about it, the best algorithm may not be to perform the matrix multiplication row-by-row on A (i.e. column-by-column on B). It may instead be to iterate over the columns of A and the elements of c. If that’s the case, then my intuition was wrong and it should be faster to store A rather than A’.

Yes, this is my intuition too. I mentally envision a matrix-vector product as splitting the matrix into separate columns, and then using the vector elements to scale each column, like this:

No idea what BLAS actually does, though.

4 Likes

Yes, that is basically correct. The columns should really be a multiple of cacheline length though (64 bytes, or 8 Float64s) though (as long as the matrix A has at least 8 rows).
Or at least a multiple of SIMD vector length (ie, 4 on most computers).

Common kernels calculate 8x6 blocks from 8xN * Nx6 inputs at a time.
Pre-avx512 computers have 16 registers, so you can hold the 8x6 answer in a total of 12 registers, 2 registers dedicated to a column of the first matrix at a time using 14 registers. Finally, iterate over the 6 columns of the right side matrix, broadcasting scalars and updating the six columns of the product with fma instructions.

avx512 is much faster for matrix multiplication because not only are your registers twice the size, but there are twice as many (32), meaning you get way more operations in before having to do a lot of swapping of register contents.

3 Likes

@Elrod and @DNF, thanks! Mystery solved: it’s incorrect to rely on the assumption that BLAS is going to follow the algorithm you learned in grade school.

I’m using functions like At_mul_B() for case like this.

Just hope that these “shortcuts” (most of them are inline functions actually) has chosen the fastest way already.

This was removed in 1.0 in favor of the generic mul!(C,A,B,).

 julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0]; Y = similar(B); mul!(Y, A, B);

  julia> Y
  2×2 Array{Float64,2}:
   3.0  3.0
   7.0  7.0
4 Likes