I’m using Julia 1.2. This is my test:
a = rand(1000, 1000) b = a' c = copy(b) @btime a * x setup=(x=rand(1000)) # 114.757 μs @btime b * x setup=(x=rand(1000)) # 94.179 μs @btime c * x setup=(x=rand(1000)) # 110.325 μs
I was expecting a and c to be at very least not slower.
After inspecting stdlib/LinearAlgebra/src/matmul.jl, it turns out that Julia passes b.parent (i.e. a) to BLAS.gemv, not b, and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?