Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector`

jeremejevs · September 26, 2019, 5:43pm

I’m using Julia 1.2. This is my test:

a = rand(1000, 1000)
b = a'
c = copy(b)

@btime a * x setup=(x=rand(1000)) # 114.757 μs
@btime b * x setup=(x=rand(1000)) # 94.179 μs
@btime c * x setup=(x=rand(1000)) # 110.325 μs

I was expecting a and c to be at very least not slower.

After inspecting stdlib/LinearAlgebra/src/matmul.jl, it turns out that Julia passes b.parent (i.e. a) to BLAS.gemv, not b, and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.

Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?

stevengj · September 26, 2019, 5:52pm

Close. It does have to do with memory, but it’s about locality, not alignment. The basic thing to understand is that it is more efficient to access consecutive (or at least nearby) data from memory than data that is separated, due to the existence of cache lines. (Consecutive access also has some advantages in utilizing SIMD instructions.)

Julia stores matrices in column-major order, so that the columns are contiguous in memory. When you multiply a transposed matrix (that has not been copied) by a vector, therefore, it can compute it as the dot product of the contiguous column (= transposed row) with the contiguous vector, which has good spatial locality and therefore utilizes cache lines efficiently.

For multiplying a non-transposed matrix by a vector, in contrast, you are taking the dot products of non-contiguous rows of the matrix with the vector, and it is harder to efficiently utilize cache lines. To improve spatial locality in this case, an optimized BLAS like OpenBLAS actually computes the dot products of several rows at a time (a “block”) with the vector, I believe — that’s why it’s only 10% slower and not much worse. (In fact, even the transposed case may do some blocking to keep the vector in cache.)

jeremejevs · September 27, 2019, 7:09am

Couldn’t have wished for a better reply, thank you!

Topic		Replies	Views
Question About Transposes and Matrix Multiplication Speed Numerics	9	2140	August 22, 2018
Multiplication after transpose much faster than multiplication after PermutedDimsArray Performance linearalgebra	3	1639	April 10, 2019
Improve performance of vector-matrix multiplication Performance performance , linearalgebra	2	624	November 16, 2020
Effect of Column Major Storage on Multiplication Performance	3	446	November 7, 2020
Naive matrix multiplication is super slow in Julia 1.0? Performance	12	3305	August 17, 2018

Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector`

Related topics