Optimizing sums of products (dot products)

If you just put @simd in front of the loop in sumproduct_two, then it matches or exceeds the speed of dot on my machine. For such a small length (100), it is actually faster than dot on my machine because the BLAS call in dot imposes some additional overhead.

Compilers can’t do @simd automatically for this sum because it changes the answers slightly (by re-ordering the additions).

For long vectors, the sum version is almost certainly more accurate, because it uses pairwise summation whereas the BLAS call in dot probably does a “naive” sum. This won’t affect you for length-100 vectors, however, since the pairwise algorithm only turns on for length > 1024 (for performance reasons).

However, note that you are talking about a different function than the OP in this old thread (sum rather than cumsum), so I will split it off into a new thread. In general, please be more reluctant to revive ancient threads rather than starting new ones (and cross-referencing/linking older threads as appropriate).

4 Likes