To clarify, there probably isn’t a faster way: for a vector of length L, the cumulative sum can be computed in L operations. Therefore a rolling sum of length n, for arbitrary n, can be computed in 2L operations. A naive rolling sum of length n is n*L operations, which for n>2 is slower than using cumsum. For very large rolling sums, the difference will be dramatic.
This focus on operations count ignores roundoff errors, which are indeed muchmay be worse for the cumsum approach. If that’s important to you, then @klaff’s Rolling Sum - #4 by klaff is a better approach. That one is also 2L, so there’s no real reason not to use it.
Admittedly I haven’t given much thought to this question, but I don’t see any obvious reason why one approach would be better than the other. And in some specific cases (for example when dealing with positive numbers), my intuition would even lead me to favor the cumsum approach (again, just intuition, I haven’t performed a serious error analysis…)
Wouldn’t an approach around Base.cumsum need to allocate arrays for the two intermediate sums? Maybe I’m not thinking correctly about how that would work.
No, there is only one array of cumulative sums. The two intermediate partial sums (that are subtracted from one another in the end) are two elements of the same array.
There could be a second array if you wanted to also store the rolling sums (instead of computing them on the fly, when the window size is known, using a single subtraction operation).
I haven’t done a serious analysis either. For the purposes of discussion let’s consider a list of numbers that are uniformly distributed between 0 and 1. The magnitude of the cumsum grows linearly with i, and consequently once you’ve summed 2/eps() of them together you’ll stop incrementing altogether due to a complete loss of precision. In contrast
out[i] = out[i-1]-a[i-1]+a[i+n-1]
intuitively seems like it should accumulate error as sqrt(i).
The threaded version I wrote (trolling_sum) recomputes the sum at the beginning of each thread’s segment (necessary so the threads can be independent), which limits the accumulated error.
On the other hand, this formula is prone to cancellations (where the relative error is unbounded), whereas in the cumsum of positive numbers, the relative error at each operation is (as you noted) bounded by eps().
But I’ll try to avoid derailing this thread any more, and perform a real error analysis instead of waving my hands
I just want to pop in here and say that the beauty of RollingFunctions.jl is mostly in its ease. I’ve used it many times to create rolling versions of my own functions. It strikes the balance between general functionality and performance quite well. It already uses views and very tight loops.
If you are doing something simple like a sum, of course a specialized method will be faster.