@Elrod , Could you share the code for the non allocating versions?
My assumption is that in a mutli threaded scenario there will be gains in real world cases, hence it is a better representation. Maybe then different winner will emerge?
Indeed, when the one can, packing multiple mat vec operations into mat mat will reduce overhead. But sometimes the different vectors are available at different times, hence it is not viable.