Numerically stable "cumulative variance" function in O(n)

I’m posting this in off-topic because my question is about algorithmic complexity and not about Julia performance.

I’m trying to write a function that computes the cumulative variance of a vector x–that is, the vector y such that y[i] is the sample variance of x[1:i]. For example, this implementation works as desired:

using StatsBase

# Numerically stable but O(n^2)
function cumvar_easy(x)
    return [var(x[begin:i], corrected=false) for i in eachindex(x)]
end

But this is a nested loop with O(n^2) complexity. By applying the formula E(X - E(X))^2 = E(X^2) - E(X)^2, we can reduce the complexity to O(n) as follows:

# O(n) but vulnerable to catastrophic cancellation
function cumvar_fromcumsums(x)
    # as[i] = mean(x[1:i]) ^ 2
    as = (cumsum(x) ./ (1:length(x))) .^ 2
    
    # bs[i] = mean(x[1:i] .^ 2)
    bs = cumsum(x .^2) ./ (1:length(x))

    return bs - as
end

This function also works, but it is vulnerable to catastrophic cancellation, i.e. numerical instability.

Is there a numerically stable way to compute this function in linear time?

Note: I realize that the functions above make tons of allocations; I omitted the usual mitigations for simplicity’s sake. This question is about order of complexity, not Julia performance.

Numerical errors arise when adding numbers with different scales. This happens when in cumsum calculation, a single element is added to the cumsum of many previous ones. The following mitigates this by keeping the normalized cumsum, at each point, instead of normalizing by dividing with 1:length(x) after calculation. It is still O(n). Can you say if it solves the problem?

        # O(n) less vulnerable to catastrophic cancellation
        function cumvar_fromcumsums2(x)
            as = similar(x)
            cur = 0.0
            for i=1:length(x)
                cur = cur*((i-1)/i)+x[i]*(1/i)
                as[i] = cur^2
            end
            bs = similar(x)
            cur = 0.0
            for i=1:length(x)
                cur = cur*((i-1)/i)+(x[i]^2)*(1/i)
                bs[i] = cur
            end

            return bs - as
        end

Good point—my use of cumsum is another source of numerical instability. But (if I am understanding this code correctly) I think that this code ultimately relies on the computation mean(X^2) - mean(X)^2, and that computation itself is numerically unstable even if the two terms are calculated to a high degree of accuracy. (For example, if the first term is 25000000000000000000000002 and the second is 25000000000000000000000001 then, when these numbers are represented as floats, the difference may wash out to zero.)

Why not use something like Welford’s algorithm?

function cumvar(x)
    sumsqr = (prevmean = zero(eltype(x)) / 1)^2
    v = Vector{typeof(sumsqr)}(undef, length(x))
    for (n, x) in enumerate(x)
        mean = (prevmean*(n-1) + x) / n
        sumsqr += (x - prevmean) * (x - mean)
        v[n] = sumsqr / n
        prevmean = mean
    end
    return v
end
5 Likes

Thank you, I just discovered this on Wikipedia too:

Algorithms for calculating variance - Wikipedia

There is a nice extension (Chan’s algo) that combines the means and variances for two samples, which is actually even better for what I need. Thanks!

Note, by the way, that this won’t work if e.g. x is an array of Int, since you won’t be able to store the resulting floating-point values. bs = similar(x) has additional problems if x is dimensionful, because then it units will be the square of the units of x.

This is why I used a more complicated initialization procedure in my cumvar function. Of course, when writing casual code for your own use, one often doesn’t worry about such things, but it’s good to get in the habit of thinking about the output types as computed from the input types.

2 Likes