I assume you’re thinking along the lines of how the accumulating statistics functions on HP calculators worked, with a `Σ+`

key and a `Σ-`

key, where those would add or remove `x,y`

sample pairs, not by keeping a list but by appropriately updating the stats registers of `Σx`

, `Σy`

, etc.

As suspected: https://github.com/JeffreySarnoff/RollingFunctions.jl/blob/f42c3e28668d63191daf93e4b62f81d9f98c1b40/src/roll/rolling.jl#L6-L16

So… make a pull-request with your suggested optimization? That’s the beauty of open-source — you can lean on others work and collaborate to improve it.

ImageFiltering has generic operations for N-dimensional arrays. But if all you want is the sum, why not use `cumsum`

?

There is a reason not to do the fast method, depending on your application.

```
julia> a=randn(1_000_000_000)
1000000000-element Array{Float64,1}:
-1.2344126204756887
-0.8239991893593851
1.178705934119596
-0.013067916777890981
-2.0283561056107553
0.2622616832126358
-0.6759869411454665
-0.5361857977094139
0.8097520205869354
0.32388073554507785
-2.527704302672418
⋮
-0.6576524013369835
-0.23953427627419813
-0.028670388394535708
-1.5805848784975132
0.48150455056217334
1.5563777415762128
-2.8107725516409716
-1.0499675458464977
-0.27204237866643
-1.3917792177726476
julia> rolling_sum(a, 20)[end]
-5.901424969621745
julia> sum(a[end-19:end])
-5.901424969620317
```

You *probably* don’t care about the last 4 digits being different due to accumulated error, although you could reset the error periodically if it was an issue.

You could also try to make it faster using `@view`

and parallelising. For example via:

```
using Distributed;
addprocs(2) # you choose
function rolling_sum(X::Array{Float64,1}, window::Int64)
# Compute rolling sum
Y = @sync @distributed (+) for i=1:window
@view(X[i:end-(window-i)])
end
# Return rolling sum
return Y;
end
```

I think you might see better results for large windows and more complicated functions to roll. I do not know the `RollingFunctions`

package very well. If there is some space I might do a commit.

See the following conversation, there are also some code examples in comments. In particular, this comment:

I don’t think the rolling_sum function is amenable to parallel processing or multiple threads until the number of cores exceeds the length of the window, and then some. The `out[i] = out[i-1]-a[i-1]+a[i+n-1]`

technique is inherently sequential, and so can’t really take advantage of SIMD. The other technique might be able to use SIMD but because it’s operations grow with length(data)*n rather than length(data), there’s a lot of inefficiency to be made up for.

EDIT: I take it back, but the problem would need to be split lengthwise, not at a deep loop level. So every 10,000 or so you’d reinitialize the sum and launch another sequentially computing task.

I wanted a rolling sum, not a cumulative sum.

Yes, but you can subtract: since

then a rolling sum of the previous n entries is just c_k - c_{k-n}.

I see. Ok, maybe I’m falling into the premature optimization rabbit hole.

To clarify, there probably isn’t a faster way: for a vector of length L, the cumulative sum can be computed in `L`

operations. Therefore a rolling sum of length `n`

, for arbitrary `n`

, can be computed in `2L`

operations. A naive rolling sum of length `n`

is `n*L`

operations, which for `n>2`

is slower than using `cumsum`

. For very large rolling sums, the difference will be dramatic.

This focus on operations count ignores roundoff errors, which ~~are indeed much~~may be worse for the `cumsum`

approach. If that’s important to you, then @klaff’s Rolling Sum is a better approach. That one is also `2L`

, so there’s no real reason not to use it.

Out of curiosity, why do you say that the cumsum approach is worse than the

```
out[i] = out[i-1]-a[i-1]+a[i+n-1]
```

technique proposed in @klaff’s approach?

Admittedly I haven’t given much thought to this question, but I don’t see any obvious reason why one approach would be better than the other. And in some specific cases (for example when dealing with positive numbers), my intuition would even lead me to favor the cumsum approach (again, just intuition, I haven’t performed a serious error analysis…)

Wouldn’t an approach around Base.cumsum need to allocate arrays for the two intermediate sums? Maybe I’m not thinking correctly about how that would work.

No, there is only one array of cumulative sums. The two intermediate partial sums (that are subtracted from one another in the end) are two elements of the same array.

There could be a second array if you wanted to also store the rolling sums (instead of computing them on the fly, when the window size is known, using a single subtraction operation).

Got it, thanks. Caffeine still being absorbed this morning.

I haven’t done a serious analysis either. For the purposes of discussion let’s consider a list of numbers that are uniformly distributed between 0 and 1. The magnitude of the cumsum grows linearly with `i`

, and consequently once you’ve summed `2/eps()`

of them together you’ll stop incrementing altogether due to a complete loss of precision. In contrast

```
out[i] = out[i-1]-a[i-1]+a[i+n-1]
```

intuitively seems like it should accumulate error as `sqrt(i)`

.

The threaded version I wrote (trolling_sum) recomputes the sum at the beginning of each thread’s segment (necessary so the threads can be independent), which limits the accumulated error.

On the other hand, this formula is prone to cancellations (where the relative error is unbounded), whereas in the cumsum of positive numbers, the relative error at each operation is (as you noted) bounded by eps().

But I’ll try to avoid derailing this thread any more, and perform a real error analysis instead of waving my hands

I just want to pop in here and say that the beauty of RollingFunctions.jl is mostly in its ease. I’ve used it many times to create rolling versions of my own functions. It strikes the balance between general functionality and performance quite well. It already uses `views`

and very tight loops.

If you are doing something simple like a sum, of course a specialized method will be faster.