Efficient creation of power series matrix or array of arrays

Nice! I posted this method above btw, but in a vector format instead of a matrix.

I find this statement a bit odd/misleading – I consider cache locality crucial in algorithms like this, and not at all a micro-optimization. To see why, reverse the two for loops like this: for j = 1:m, i = 2:n and re-run and you’ll see the performance drop 10-fold! Same algorithm, same FPO count, but it executes 10 times slower. That’s only twice as fast as the naive solution, meaning that cache locality can be more important than picking the right algorithm. (The reason for this is explained in more detail in the link in my previous post.)

SIMD can also lead to enormous improvements, there was a recent topic where we played with vectorized and branchless code and were able to make some sample code several 100 times faster. The code you posted will already be SIMD-vectorized (and loop-unrolled) automatically by the compiler btw, so it’s not something you need to enable manually.

But perhaps you knew all of this already, and meant that although your code is already high-performing, you could still optimize it further