# Roundoff error in variance of array of constant value

In the code below, var(M) is not 0, I guess the reason might be computation error, but I wonder if anyone knows how to deal with this? Thanks in advance.

``````M = fill(5.421241248937501521, (1,10000))
M = Float64.(M)
var(M)
var(M[:,100:10000])
``````

It’s just floating-point roundoff error.

``````julia> var(M)
3.865805016084566e-29
``````
3 Likes

The limited precision of `Float64` is even relevant just for parsing the literal:

``````julia> 5.421241248937501521
5.421241248937502
``````

So it’s no surprise that scaling a large sum of values to compute the mean then scaling a large sum of squares of deviations from the mean to compute the variance will deviate. In this case, you can reduce the number crunching by informing `var` that the mean is exactly the one value filling the entire matrix, so it’ll skip the sum for that part:

``````julia> var(M; mean=M[begin])
0.0
``````

Now that all the deviations from the mean are exactly `0.0`, those will square, sum, and scale down to exactly `0.0` as well. For cases where you don’t just have the one particular value, you just have to work with the limited precision in mind and round to an accepted precision; it’s the same reason why exact comparisons `== 0.0` are far rarer than approximate ones.

It is possible for a type to far more efficiently represent an array filled with 1 particular value and for `var` to specially dispatch on said type to skip all the number crunching to getting the matching type of zero. FillArrays.jl implements this:

``````julia> using FillArrays, Statistics

julia> M2 = Fill(5.421241248937501521, 1, 10000)
1×10000 Fill{Float64}, with entries equal to 5.421241248937502

julia> fieldnames(typeof(M2))
(:value, :axes)

julia> M2.value # only stores 1 value, not 10000 copies
5.421241248937502

julia> var(M2)
0.0

julia> @which var(M2)
var(A::FillArrays.AbstractFill{T}; corrected, mean, dims) where T<:Number
@ FillArraysStatisticsExt C:\...\.julia\packages\FillArrays\...\ext\FillArraysStatisticsExt.jl:17
``````
3 Likes

First, here are the results in the REPL.

``````julia> M = fill(5.421241248937501521, (1,10000))
1×10000 Matrix{Float64}:
5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> M = Float64.(M)
1×10000 Matrix{Float64}:
5.42124  5.42124  5.42124  …  5.42124  5.42124

julia> var(M)
1.333308260649575e-28

julia> var(M[:,100:10000])
1.5463235527558329e-28
``````

The main issue here is the calculation of the mean.

``````julia> mean(M)
5.42124124893749

julia> M[1] - mean(M)
1.1546319456101628e-14

julia> eps(M[1])
8.881784197001252e-16
``````

Fortunately, if we can provide an accurate mean, then we can calculate accurate variance.

``````julia> var(M; mean=M[1])
0.0
``````

This suggests an extended approach for this situation:

``````julia> ou_var(X) = try
_ou_mean = only(unique(X))
var(X; mean=_ou_mean)
catch _
var(X)
end
ou_var (generic function with 1 method)

julia> ou_var(M)
0.0
``````
2 Likes

As already noted you have a floating-point roundoff error already in the mean computation. If you’re willing to spend more calculation time you can improve on this with a correction term:

``````julia> M[1]
5.421241248937502

julia> μ₁ = mean(M)
5.421241248937496

julia> μ₂ = μ₁ + mean(M .- μ₁)
5.421241248937502
``````

and you can integrate this correction into the variance computation simply by

``````julia> var(M .- mean(M))
0.0
``````
2 Likes

Calculating variance accurately can take a bit of care. In particular, `@GunnarFarneback`’s reccomendation looks pretty good (EDIT: although the builtin algorithm is already quite reliable, so unless this nonzero-for-constant-inputs is a problem, I would just use `var` as-is). If you want a cheaper but marginally less accurate calculation, `var(M .- first(M))` (or offset by any other convex combination of elements of `M`) is also decent (not only in this case).

It appears that `Statistics` uses the Welford algorithm when the mean is not provided and the target is an general iterator. For `AbstractArrays`, however, it uses a two-pass algorithm. Obviously, the Welford algorithm is exact in this special case:

``````julia> var(M)
3.865805016084566e-29

julia> var(Iterators.take(M, length(M)))
0.0
``````

I can’t say which should be preferred in general. Clearly, someone made the deliberate choice to use a two-pass algorithm for `AbstractArray`. Given the two-pass’s cost, I must imagine it is slightly preferable in most situations.

Note that the built-in algorithms take this care, and are numerically stable.

In the example above, the backwards error is extremely small already. It’s not clear to me in what context this error is a practical problem, or if it’s just one of those cases where people are surprised by the existence of a round off error.

There are ways to get exactly rounded results, of course, but they are only worth it in rare circumstances.

6 Likes

To elaborate, the correction term `mean(M .- μ₁)` works because it’s summing much smaller values (ideally 0 variance after all) than `mean(M)` did and can thus represent the small error. The correction is less helpful with larger variances:

``````julia> using Statistics, Random

julia> function testmean(intended, offset, halfn::Integer)
M = shuffle!(vcat(fill(intended-offset, halfn),
fill(intended+offset, halfn)))
mu = mean(M)
correction = mean(M .- mu)
mu, correction, mu+correction
end
testmean (generic function with 1 method)

julia> testmean(5.3, 0, 5000)
(5.300000000000007, -7.105427357601002e-15, 5.3)

julia> testmean(5.3, 50, 5000) # can overcorrect
(5.3000000000000025, -3.819877747446298e-15, 5.299999999999999)

julia> testmean(5.3, 100, 5000) # doesn't correct at all
(5.299999999999995, 0.0, 5.299999999999995)
``````