Asymmetric speed of in-place `sparse*dense` matrix product

Thanks for the comments! Nice to see this PR.

That’s exactly what I did. However, I had to add something like

    z = zero(TY)
    @inbounds for i in eachindex(Y)
      Y[i] = z
    end

because I can’t assume that Y is initalizied with zeros (I’m applying A_mul_B! multiple times).