I don’t know what you are doing (see points 3 & 4 here).
Show me that the performance suffers using my minimal example (perhaps with a larger vector and a larger matrix). Include the vector and matrix explicitly, so we have something specific to compare.