Wow - thanks for testing this. I’m trying to understand this now. So you saying that this “gather” operation when performed in every loop iteration will make AVX essentially useless. That is, I need to figure out a representation that doesn’t require me to load basis[a, n]
at each iteration, but just compute it on the fly?