I’ve noticed that for Julia 0.7.0 that A*c is faster than A'*c in the following case, when I would have expected that A' would make the align the memory more easily for the computer to perform the matrix multiplication.
@mohamed82008, thanks for your response. I made a couple typos in transcribing to Discourse. I meant to both include the $ and transpose B in the matrix multiplication (i.e. write @btime $B'*$c, which I have now fixed in the original post).
The issue I’m pointing to is that I would have expected B'*c to be faster than A*c because the matrix multiplication can iterate over columns of B as opposed to rows of A.
@DNF, perhaps you’re right and my intuition about the “right” dimension on which to iterate was wrong. Now that I think about it, the best algorithm may not be to perform the matrix multiplication row-by-row on A (i.e. column-by-column on B). It may instead be to iterate over the columns of A and the elements of c. If that’s the case, then my intuition was wrong and it should be faster to store A rather than A’.
Yes, that is basically correct. The columns should really be a multiple of cacheline length though (64 bytes, or 8 Float64s) though (as long as the matrix A has at least 8 rows).
Or at least a multiple of SIMD vector length (ie, 4 on most computers).
Common kernels calculate 8x6 blocks from 8xN * Nx6 inputs at a time.
Pre-avx512 computers have 16 registers, so you can hold the 8x6 answer in a total of 12 registers, 2 registers dedicated to a column of the first matrix at a time using 14 registers. Finally, iterate over the 6 columns of the right side matrix, broadcasting scalars and updating the six columns of the product with fma instructions.
avx512 is much faster for matrix multiplication because not only are your registers twice the size, but there are twice as many (32), meaning you get way more operations in before having to do a lot of swapping of register contents.