# Rolling Sum

I assume you’re thinking along the lines of how the accumulating statistics functions on HP calculators worked, with a Σ+ key and a Σ- key, where those would add or remove x,y sample pairs, not by keeping a list but by appropriately updating the stats registers of Σx, Σy, etc.

1 Like

So… make a pull-request with your suggested optimization? That’s the beauty of open-source — you can lean on others work and collaborate to improve it.

7 Likes

ImageFiltering has generic operations for N-dimensional arrays. But if all you want is the sum, why not use cumsum?

3 Likes

There is a reason not to do the fast method, depending on your application.

julia> a=randn(1_000_000_000)
1000000000-element Array{Float64,1}:
-1.2344126204756887
-0.8239991893593851
1.178705934119596
-0.013067916777890981
-2.0283561056107553
0.2622616832126358
-0.6759869411454665
-0.5361857977094139
0.8097520205869354
0.32388073554507785
-2.527704302672418
⋮
-0.6576524013369835
-0.23953427627419813
-0.028670388394535708
-1.5805848784975132
0.48150455056217334
1.5563777415762128
-2.8107725516409716
-1.0499675458464977
-0.27204237866643
-1.3917792177726476

julia> rolling_sum(a, 20)[end]
-5.901424969621745

julia> sum(a[end-19:end])
-5.901424969620317


You probably don’t care about the last 4 digits being different due to accumulated error, although you could reset the error periodically if it was an issue.

3 Likes

You could also try to make it faster using @view and parallelising. For example via:

using Distributed;

function rolling_sum(X::Array{Float64,1}, window::Int64)

# Compute rolling sum
Y = @sync @distributed (+) for i=1:window
@view(X[i:end-(window-i)])
end

# Return rolling sum
return Y;
end


I think you might see better results for large windows and more complicated functions to roll. I do not know the RollingFunctions package very well. If there is some space I might do a commit.

See the following conversation, there are also some code examples in comments. In particular, this comment:

I don’t think the rolling_sum function is amenable to parallel processing or multiple threads until the number of cores exceeds the length of the window, and then some. The out[i] = out[i-1]-a[i-1]+a[i+n-1] technique is inherently sequential, and so can’t really take advantage of SIMD. The other technique might be able to use SIMD but because it’s operations grow with length(data)*n rather than length(data), there’s a lot of inefficiency to be made up for.

EDIT: I take it back, but the problem would need to be split lengthwise, not at a deep loop level. So every 10,000 or so you’d reinitialize the sum and launch another sequentially computing task.

I wanted a rolling sum, not a cumulative sum.

Yes, but you can subtract: since

c_k = \sum_{i = 1}^k x_i

then a rolling sum of the previous n entries is just c_k - c_{k-n}.

6 Likes

I see. Ok, maybe I’m falling into the premature optimization rabbit hole.

1 Like

To clarify, there probably isn’t a faster way: for a vector of length L, the cumulative sum can be computed in L operations. Therefore a rolling sum of length n, for arbitrary n, can be computed in 2L operations. A naive rolling sum of length n is n*L operations, which for n>2 is slower than using cumsum. For very large rolling sums, the difference will be dramatic.

This focus on operations count ignores roundoff errors, which are indeed muchmay be worse for the cumsum approach. If that’s important to you, then @klaff’s Rolling Sum is a better approach. That one is also 2L, so there’s no real reason not to use it.

3 Likes

Out of curiosity, why do you say that the cumsum approach is worse than the

    out[i] = out[i-1]-a[i-1]+a[i+n-1]


technique proposed in @klaff’s approach?

Admittedly I haven’t given much thought to this question, but I don’t see any obvious reason why one approach would be better than the other. And in some specific cases (for example when dealing with positive numbers), my intuition would even lead me to favor the cumsum approach (again, just intuition, I haven’t performed a serious error analysis…)

Wouldn’t an approach around Base.cumsum need to allocate arrays for the two intermediate sums? Maybe I’m not thinking correctly about how that would work.

No, there is only one array of cumulative sums. The two intermediate partial sums (that are subtracted from one another in the end) are two elements of the same array.

There could be a second array if you wanted to also store the rolling sums (instead of computing them on the fly, when the window size is known, using a single subtraction operation).

2 Likes

Got it, thanks. Caffeine still being absorbed this morning.

I haven’t done a serious analysis either. For the purposes of discussion let’s consider a list of numbers that are uniformly distributed between 0 and 1. The magnitude of the cumsum grows linearly with i, and consequently once you’ve summed 2/eps() of them together you’ll stop incrementing altogether due to a complete loss of precision. In contrast

out[i] = out[i-1]-a[i-1]+a[i+n-1]


intuitively seems like it should accumulate error as sqrt(i).

The threaded version I wrote (trolling_sum) recomputes the sum at the beginning of each thread’s segment (necessary so the threads can be independent), which limits the accumulated error.

On the other hand, this formula is prone to cancellations (where the relative error is unbounded), whereas in the cumsum of positive numbers, the relative error at each operation is (as you noted) bounded by eps().

But I’ll try to avoid derailing this thread any more, and perform a real error analysis instead of waving my hands 1 Like

I just want to pop in here and say that the beauty of RollingFunctions.jl is mostly in its ease. I’ve used it many times to create rolling versions of my own functions. It strikes the balance between general functionality and performance quite well. It already uses views and very tight loops.

If you are doing something simple like a sum, of course a specialized method will be faster.

3 Likes