Yeah, but then the result of each of those multiplications go into separate entries in the output vector, instead of each column-vector product accumulating into a single element in the output vector.
But no matter, I don’t really know how BLAS optimizes this stuff. I have just observed that At_mul_B! is faster by some significant amount, and that it seems easier to optimize it in a naive way.