# Numerically stable "cumulative variance" function in O(n)

I’m posting this in off-topic because my question is about algorithmic complexity and not about Julia performance.

I’m trying to write a function that computes the cumulative variance of a vector `x`–that is, the vector `y` such that `y[i]` is the sample variance of `x[1:i]`. For example, this implementation works as desired:

``````using StatsBase

# Numerically stable but O(n^2)
function cumvar_easy(x)
return [var(x[begin:i], corrected=false) for i in eachindex(x)]
end
``````

But this is a nested loop with O(n^2) complexity. By applying the formula E(X - E(X))^2 = E(X^2) - E(X)^2, we can reduce the complexity to O(n) as follows:

``````# O(n) but vulnerable to catastrophic cancellation
function cumvar_fromcumsums(x)
# as[i] = mean(x[1:i]) ^ 2
as = (cumsum(x) ./ (1:length(x))) .^ 2

# bs[i] = mean(x[1:i] .^ 2)
bs = cumsum(x .^2) ./ (1:length(x))

return bs - as
end
``````

This function also works, but it is vulnerable to catastrophic cancellation, i.e. numerical instability.

Is there a numerically stable way to compute this function in linear time?

Note: I realize that the functions above make tons of allocations; I omitted the usual mitigations for simplicity’s sake. This question is about order of complexity, not Julia performance.

Numerical errors arise when adding numbers with different scales. This happens when in `cumsum` calculation, a single element is added to the `cumsum` of many previous ones. The following mitigates this by keeping the normalized cumsum, at each point, instead of normalizing by dividing with `1:length(x)` after calculation. It is still O(n). Can you say if it solves the problem?

``````        # O(n) less vulnerable to catastrophic cancellation
function cumvar_fromcumsums2(x)
as = similar(x)
cur = 0.0
for i=1:length(x)
cur = cur*((i-1)/i)+x[i]*(1/i)
as[i] = cur^2
end
bs = similar(x)
cur = 0.0
for i=1:length(x)
cur = cur*((i-1)/i)+(x[i]^2)*(1/i)
bs[i] = cur
end

return bs - as
end
``````

Good point—my use of `cumsum` is another source of numerical instability. But (if I am understanding this code correctly) I think that this code ultimately relies on the computation `mean(X^2) - mean(X)^2`, and that computation itself is numerically unstable even if the two terms are calculated to a high degree of accuracy. (For example, if the first term is `25000000000000000000000002` and the second is `25000000000000000000000001` then, when these numbers are represented as floats, the difference may wash out to zero.)

Why not use something like Welford’s algorithm?

``````function cumvar(x)
sumsqr = (prevmean = zero(eltype(x)) / 1)^2
v = Vector{typeof(sumsqr)}(undef, length(x))
for (n, x) in enumerate(x)
mean = (prevmean*(n-1) + x) / n
sumsqr += (x - prevmean) * (x - mean)
v[n] = sumsqr / n
prevmean = mean
end
return v
end
``````
5 Likes

Thank you, I just discovered this on Wikipedia too:

Algorithms for calculating variance - Wikipedia

There is a nice extension (Chan’s algo) that combines the means and variances for two samples, which is actually even better for what I need. Thanks!

Note, by the way, that this won’t work if e.g. `x` is an array of `Int`, since you won’t be able to store the resulting floating-point values. `bs = similar(x)` has additional problems if `x` is dimensionful, because then it units will be the square of the units of `x`.

This is why I used a more complicated initialization procedure in my `cumvar` function. Of course, when writing casual code for your own use, one often doesn’t worry about such things, but it’s good to get in the habit of thinking about the output types as computed from the input types.

2 Likes