Numerically stable "cumulative variance" function in O(n)

maxkapur · December 19, 2022, 7:10pm

I’m posting this in off-topic because my question is about algorithmic complexity and not about Julia performance.

I’m trying to write a function that computes the cumulative variance of a vector x–that is, the vector y such that y[i] is the sample variance of x[1:i]. For example, this implementation works as desired:

using StatsBase

# Numerically stable but O(n^2)
function cumvar_easy(x)
    return [var(x[begin:i], corrected=false) for i in eachindex(x)]
end

But this is a nested loop with O(n^2) complexity. By applying the formula E(X - E(X))^2 = E(X^2) - E(X)^2, we can reduce the complexity to O(n) as follows:

# O(n) but vulnerable to catastrophic cancellation
function cumvar_fromcumsums(x)
    # as[i] = mean(x[1:i]) ^ 2
    as = (cumsum(x) ./ (1:length(x))) .^ 2
    
    # bs[i] = mean(x[1:i] .^ 2)
    bs = cumsum(x .^2) ./ (1:length(x))

    return bs - as
end

This function also works, but it is vulnerable to catastrophic cancellation, i.e. numerical instability.

Is there a numerically stable way to compute this function in linear time?

Note: I realize that the functions above make tons of allocations; I omitted the usual mitigations for simplicity’s sake. This question is about order of complexity, not Julia performance.

Dan · December 19, 2022, 7:29pm

Numerical errors arise when adding numbers with different scales. This happens when in cumsum calculation, a single element is added to the cumsum of many previous ones. The following mitigates this by keeping the normalized cumsum, at each point, instead of normalizing by dividing with 1:length(x) after calculation. It is still O(n). Can you say if it solves the problem?

        # O(n) less vulnerable to catastrophic cancellation
        function cumvar_fromcumsums2(x)
            as = similar(x)
            cur = 0.0
            for i=1:length(x)
                cur = cur*((i-1)/i)+x[i]*(1/i)
                as[i] = cur^2
            end
            bs = similar(x)
            cur = 0.0
            for i=1:length(x)
                cur = cur*((i-1)/i)+(x[i]^2)*(1/i)
                bs[i] = cur
            end

            return bs - as
        end

maxkapur · December 19, 2022, 7:37pm

Good point—my use of cumsum is another source of numerical instability. But (if I am understanding this code correctly) I think that this code ultimately relies on the computation mean(X^2) - mean(X)^2, and that computation itself is numerically unstable even if the two terms are calculated to a high degree of accuracy. (For example, if the first term is 25000000000000000000000002 and the second is 25000000000000000000000001 then, when these numbers are represented as floats, the difference may wash out to zero.)

stevengj · December 19, 2022, 7:57pm

Why not use something like Welford’s algorithm?

function cumvar(x)
    sumsqr = (prevmean = zero(eltype(x)) / 1)^2
    v = Vector{typeof(sumsqr)}(undef, length(x))
    for (n, x) in enumerate(x)
        mean = (prevmean*(n-1) + x) / n
        sumsqr += (x - prevmean) * (x - mean)
        v[n] = sumsqr / n
        prevmean = mean
    end
    return v
end

maxkapur · December 19, 2022, 8:04pm

Thank you, I just discovered this on Wikipedia too:

Algorithms for calculating variance - Wikipedia

There is a nice extension (Chan’s algo) that combines the means and variances for two samples, which is actually even better for what I need. Thanks!

stevengj · December 19, 2022, 8:12pm

Note, by the way, that this won’t work if e.g. x is an array of Int, since you won’t be able to store the resulting floating-point values. bs = similar(x) has additional problems if x is dimensionful, because then it units will be the square of the units of x.

This is why I used a more complicated initialization procedure in my cumvar function. Of course, when writing casual code for your own use, one often doesn’t worry about such things, but it’s good to get in the habit of thinking about the output types as computed from the input types.

Topic		Replies	Views
Roundoff error in variance of array of constant value New to Julia float , precision	7	168	July 11, 2024
Cumsum! safe to use in-place? General Usage question	9	2088	May 3, 2020
Bug in var() when supplying means as vectors Statistics	4	464	April 5, 2020
Unstable execution time with high standard error in Julia Performance question	27	512	May 28, 2025
Compilers, purity, and variance computation Numerics statistics , float , compiler	7	453	September 12, 2024

Numerically stable "cumulative variance" function in O(n)

Related topics