Slow sparse matrix-vector product with symmetric matrices

Yeah, but then the result of each of those multiplications go into separate entries in the output vector, instead of each column-vector product accumulating into a single element in the output vector.

But no matter, I don’t really know how BLAS optimizes this stuff. I have just observed that At_mul_B! is faster by some significant amount, and that it seems easier to optimize it in a naive way.