Getting non-vectorized code to the speed of vectorized ones

@DNF, assuming this is correct, perhaps your ideas on this subject would merit a place in the performance tips section?

It’s not really an idea, as such. There are people on this board who know about this. I’m just reporting my understanding of things I’ve read.

I think this, anyway, is a micro-optimization, perhaps too marginal for the performance tips.

Tested again with @benchmark and the median speedups were sytematically of the order of ~20%. This is not a micro saving, and the idea behind, if correct, should be highlighted.