Roundoff error in variance of array of constant value

In the code below, var(M) is not 0, I guess the reason might be computation error, but I wonder if anyone knows how to deal with this? Thanks in advance.

M = fill(5.421241248937501521, (1,10000))
M = Float64.(M)
var(M)
var(M[:,100:10000])

It’s just floating-point roundoff error.

julia> var(M)
3.865805016084566e-29
3 Likes

The limited precision of Float64 is even relevant just for parsing the literal:

julia> 5.421241248937501521
5.421241248937502

So it’s no surprise that scaling a large sum of values to compute the mean then scaling a large sum of squares of deviations from the mean to compute the variance will deviate. In this case, you can reduce the number crunching by informing var that the mean is exactly the one value filling the entire matrix, so it’ll skip the sum for that part:

julia> var(M; mean=M[begin])
0.0

Now that all the deviations from the mean are exactly 0.0, those will square, sum, and scale down to exactly 0.0 as well. For cases where you don’t just have the one particular value, you just have to work with the limited precision in mind and round to an accepted precision; it’s the same reason why exact comparisons == 0.0 are far rarer than approximate ones.

It is possible for a type to far more efficiently represent an array filled with 1 particular value and for var to specially dispatch on said type to skip all the number crunching to getting the matching type of zero. FillArrays.jl implements this:

julia> using FillArrays, Statistics

julia> M2 = Fill(5.421241248937501521, 1, 10000)
1×10000 Fill{Float64}, with entries equal to 5.421241248937502

julia> fieldnames(typeof(M2))
(:value, :axes)

julia> M2.value # only stores 1 value, not 10000 copies
5.421241248937502

julia> var(M2)
0.0

julia> @which var(M2)
var(A::FillArrays.AbstractFill{T}; corrected, mean, dims) where T<:Number
     @ FillArraysStatisticsExt C:\...\.julia\packages\FillArrays\...\ext\FillArraysStatisticsExt.jl:17
3 Likes

First, here are the results in the REPL.

julia> M = fill(5.421241248937501521, (1,10000))
1×10000 Matrix{Float64}:
 5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> M = Float64.(M)
1×10000 Matrix{Float64}:
 5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> var(M)
1.333308260649575e-28

julia> var(M[:,100:10000])
1.5463235527558329e-28

The main issue here is the calculation of the mean.

julia> mean(M)
5.42124124893749

julia> M[1] - mean(M)
1.1546319456101628e-14

julia> eps(M[1])
8.881784197001252e-16

Fortunately, if we can provide an accurate mean, then we can calculate accurate variance.

julia> var(M; mean=M[1])
0.0

This suggests an extended approach for this situation:

julia> ou_var(X) = try
           _ou_mean = only(unique(X))
           var(X; mean=_ou_mean)
       catch _
           var(X)
       end
ou_var (generic function with 1 method)

julia> ou_var(M)
0.0
2 Likes

As already noted you have a floating-point roundoff error already in the mean computation. If you’re willing to spend more calculation time you can improve on this with a correction term:

julia> M[1]
5.421241248937502

julia> μ₁ = mean(M)
5.421241248937496

julia> μ₂ = μ₁ + mean(M .- μ₁)
5.421241248937502

and you can integrate this correction into the variance computation simply by

julia> var(M .- mean(M))
0.0
2 Likes

Calculating variance accurately can take a bit of care. In particular, @GunnarFarneback’s reccomendation looks pretty good (EDIT: although the builtin algorithm is already quite reliable, so unless this nonzero-for-constant-inputs is a problem, I would just use var as-is). If you want a cheaper but marginally less accurate calculation, var(M .- first(M)) (or offset by any other convex combination of elements of M) is also decent (not only in this case).

It appears that Statistics uses the Welford algorithm when the mean is not provided and the target is an general iterator. For AbstractArrays, however, it uses a two-pass algorithm. Obviously, the Welford algorithm is exact in this special case:

julia> var(M)
3.865805016084566e-29

julia> var(Iterators.take(M, length(M)))
0.0

I can’t say which should be preferred in general. Clearly, someone made the deliberate choice to use a two-pass algorithm for AbstractArray. Given the two-pass’s cost, I must imagine it is slightly preferable in most situations.

Note that the built-in algorithms take this care, and are numerically stable.

In the example above, the backwards error is extremely small already. It’s not clear to me in what context this error is a practical problem, or if it’s just one of those cases where people are surprised by the existence of a round off error.

There are ways to get exactly rounded results, of course, but they are only worth it in rare circumstances.

6 Likes

To elaborate, the correction term mean(M .- μ₁) works because it’s summing much smaller values (ideally 0 variance after all) than mean(M) did and can thus represent the small error. The correction is less helpful with larger variances:

julia> using Statistics, Random

julia> function testmean(intended, offset, halfn::Integer)
         M = shuffle!(vcat(fill(intended-offset, halfn),
                           fill(intended+offset, halfn)))
         mu = mean(M)
         correction = mean(M .- mu)
         mu, correction, mu+correction
       end
testmean (generic function with 1 method)

julia> testmean(5.3, 0, 5000)
(5.300000000000007, -7.105427357601002e-15, 5.3)

julia> testmean(5.3, 50, 5000) # can overcorrect
(5.3000000000000025, -3.819877747446298e-15, 5.299999999999999)

julia> testmean(5.3, 100, 5000) # doesn't correct at all
(5.299999999999995, 0.0, 5.299999999999995)