Roundoff error in variance of array of constant value

jgr · July 11, 2024, 2:06am

In the code below, var(M) is not 0, I guess the reason might be computation error, but I wonder if anyone knows how to deal with this? Thanks in advance.

M = fill(5.421241248937501521, (1,10000))
M = Float64.(M)
var(M)
var(M[:,100:10000])

stevengj · July 11, 2024, 2:54am

It’s just floating-point roundoff error.

julia> var(M)
3.865805016084566e-29

Benny · July 11, 2024, 3:02am

The limited precision of Float64 is even relevant just for parsing the literal:

julia> 5.421241248937501521
5.421241248937502

So it’s no surprise that scaling a large sum of values to compute the mean then scaling a large sum of squares of deviations from the mean to compute the variance will deviate. In this case, you can reduce the number crunching by informing var that the mean is exactly the one value filling the entire matrix, so it’ll skip the sum for that part:

julia> var(M; mean=M[begin])
0.0

Now that all the deviations from the mean are exactly 0.0, those will square, sum, and scale down to exactly 0.0 as well. For cases where you don’t just have the one particular value, you just have to work with the limited precision in mind and round to an accepted precision; it’s the same reason why exact comparisons == 0.0 are far rarer than approximate ones.

It is possible for a type to far more efficiently represent an array filled with 1 particular value and for var to specially dispatch on said type to skip all the number crunching to getting the matching type of zero. FillArrays.jl implements this:

julia> using FillArrays, Statistics

julia> M2 = Fill(5.421241248937501521, 1, 10000)
1×10000 Fill{Float64}, with entries equal to 5.421241248937502

julia> fieldnames(typeof(M2))
(:value, :axes)

julia> M2.value # only stores 1 value, not 10000 copies
5.421241248937502

julia> var(M2)
0.0

julia> @which var(M2)
var(A::FillArrays.AbstractFill{T}; corrected, mean, dims) where T<:Number
     @ FillArraysStatisticsExt C:\...\.julia\packages\FillArrays\...\ext\FillArraysStatisticsExt.jl:17

mkitti · July 11, 2024, 3:07am

First, here are the results in the REPL.

julia> M = fill(5.421241248937501521, (1,10000))
1×10000 Matrix{Float64}:
 5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> M = Float64.(M)
1×10000 Matrix{Float64}:
 5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> var(M)
1.333308260649575e-28

julia> var(M[:,100:10000])
1.5463235527558329e-28

The main issue here is the calculation of the mean.

julia> mean(M)
5.42124124893749

julia> M[1] - mean(M)
1.1546319456101628e-14

julia> eps(M[1])
8.881784197001252e-16

Fortunately, if we can provide an accurate mean, then we can calculate accurate variance.

julia> var(M; mean=M[1])
0.0

This suggests an extended approach for this situation:

julia> ou_var(X) = try
           _ou_mean = only(unique(X))
           var(X; mean=_ou_mean)
       catch _
           var(X)
       end
ou_var (generic function with 1 method)

julia> ou_var(M)
0.0

GunnarFarneback · July 11, 2024, 11:52am

As already noted you have a floating-point roundoff error already in the mean computation. If you’re willing to spend more calculation time you can improve on this with a correction term:

julia> M[1]
5.421241248937502

julia> μ₁ = mean(M)
5.421241248937496

julia> μ₂ = μ₁ + mean(M .- μ₁)
5.421241248937502

and you can integrate this correction into the variance computation simply by

julia> var(M .- mean(M))
0.0

mikmoore · July 11, 2024, 2:07pm

Calculating variance accurately can take a bit of care. In particular, @GunnarFarneback’s reccomendation looks pretty good (EDIT: although the builtin algorithm is already quite reliable, so unless this nonzero-for-constant-inputs is a problem, I would just use var as-is). If you want a cheaper but marginally less accurate calculation, var(M .- first(M)) (or offset by any other convex combination of elements of M) is also decent (not only in this case).

It appears that Statistics uses the Welford algorithm when the mean is not provided and the target is an general iterator. For AbstractArrays, however, it uses a two-pass algorithm. Obviously, the Welford algorithm is exact in this special case:

julia> var(M)
3.865805016084566e-29

julia> var(Iterators.take(M, length(M)))
0.0

I can’t say which should be preferred in general. Clearly, someone made the deliberate choice to use a two-pass algorithm for AbstractArray. Given the two-pass’s cost, I must imagine it is slightly preferable in most situations.

stevengj · July 11, 2024, 2:21pm

Note that the built-in algorithms take this care, and are numerically stable.

In the example above, the backwards error is extremely small already. It’s not clear to me in what context this error is a practical problem, or if it’s just one of those cases where people are surprised by the existence of a round off error.

There are ways to get exactly rounded results, of course, but they are only worth it in rare circumstances.

Benny · July 11, 2024, 10:31pm

To elaborate, the correction term mean(M .- μ₁) works because it’s summing much smaller values (ideally 0 variance after all) than mean(M) did and can thus represent the small error. The correction is less helpful with larger variances:

julia> using Statistics, Random

julia> function testmean(intended, offset, halfn::Integer)
         M = shuffle!(vcat(fill(intended-offset, halfn),
                           fill(intended+offset, halfn)))
         mu = mean(M)
         correction = mean(M .- mu)
         mu, correction, mu+correction
       end
testmean (generic function with 1 method)

julia> testmean(5.3, 0, 5000)
(5.300000000000007, -7.105427357601002e-15, 5.3)

julia> testmean(5.3, 50, 5000) # can overcorrect
(5.3000000000000025, -3.819877747446298e-15, 5.299999999999999)

julia> testmean(5.3, 100, 5000) # doesn't correct at all
(5.299999999999995, 0.0, 5.299999999999995)

Topic		Replies	Views
Bug in var() when supplying means as vectors Statistics	4	464	April 5, 2020
No method matching error Optimization (Mathematical)	2	212	March 2, 2024
Statistics.var wasting time calculating the mean? General Usage statistics	8	707	June 8, 2020
Numerically stable "cumulative variance" function in O(n) Statistics	5	603	December 19, 2022
Trying to understand reported error General Usage	3	280	April 18, 2021

Roundoff error in variance of array of constant value

Related topics