If you just put @simd
in front of the loop in sumproduct_two
, then it matches or exceeds the speed of dot
on my machine. For such a small length (100), it is actually faster than dot
on my machine because the BLAS call in dot
imposes some additional overhead.
Compilers can’t do @simd
automatically for this sum because it changes the answers slightly (by re-ordering the additions).
For long vectors, the sum
version is almost certainly more accurate, because it uses pairwise summation whereas the BLAS call in dot
probably does a “naive” sum. This won’t affect you for length-100 vectors, however, since the pairwise algorithm only turns on for length > 1024 (for performance reasons).
However, note that you are talking about a different function than the OP in this old thread (sum
rather than cumsum
), so I will split it off into a new thread. In general, please be more reluctant to revive ancient threads rather than starting new ones (and cross-referencing/linking older threads as appropriate).