In the code below, var(M) is not 0, I guess the reason might be computation error, but I wonder if anyone knows how to deal with this? Thanks in advance.
M = fill(5.421241248937501521, (1,10000))
M = Float64.(M)
var(M)
var(M[:,100:10000])
In the code below, var(M) is not 0, I guess the reason might be computation error, but I wonder if anyone knows how to deal with this? Thanks in advance.
M = fill(5.421241248937501521, (1,10000))
M = Float64.(M)
var(M)
var(M[:,100:10000])
It’s just floating-point roundoff error.
julia> var(M)
3.865805016084566e-29
The limited precision of Float64
is even relevant just for parsing the literal:
julia> 5.421241248937501521
5.421241248937502
So it’s no surprise that scaling a large sum of values to compute the mean then scaling a large sum of squares of deviations from the mean to compute the variance will deviate. In this case, you can reduce the number crunching by informing var
that the mean is exactly the one value filling the entire matrix, so it’ll skip the sum for that part:
julia> var(M; mean=M[begin])
0.0
Now that all the deviations from the mean are exactly 0.0
, those will square, sum, and scale down to exactly 0.0
as well. For cases where you don’t just have the one particular value, you just have to work with the limited precision in mind and round to an accepted precision; it’s the same reason why exact comparisons == 0.0
are far rarer than approximate ones.
It is possible for a type to far more efficiently represent an array filled with 1 particular value and for var
to specially dispatch on said type to skip all the number crunching to getting the matching type of zero. FillArrays.jl implements this:
julia> using FillArrays, Statistics
julia> M2 = Fill(5.421241248937501521, 1, 10000)
1×10000 Fill{Float64}, with entries equal to 5.421241248937502
julia> fieldnames(typeof(M2))
(:value, :axes)
julia> M2.value # only stores 1 value, not 10000 copies
5.421241248937502
julia> var(M2)
0.0
julia> @which var(M2)
var(A::FillArrays.AbstractFill{T}; corrected, mean, dims) where T<:Number
@ FillArraysStatisticsExt C:\...\.julia\packages\FillArrays\...\ext\FillArraysStatisticsExt.jl:17
First, here are the results in the REPL.
julia> M = fill(5.421241248937501521, (1,10000))
1×10000 Matrix{Float64}:
5.42124 5.42124 5.42124 … 5.42124 5.42124
julia> M = Float64.(M)
1×10000 Matrix{Float64}:
5.42124 5.42124 5.42124 … 5.42124 5.42124
julia> var(M)
1.333308260649575e-28
julia> var(M[:,100:10000])
1.5463235527558329e-28
The main issue here is the calculation of the mean.
julia> mean(M)
5.42124124893749
julia> M[1] - mean(M)
1.1546319456101628e-14
julia> eps(M[1])
8.881784197001252e-16
Fortunately, if we can provide an accurate mean, then we can calculate accurate variance.
julia> var(M; mean=M[1])
0.0
This suggests an extended approach for this situation:
julia> ou_var(X) = try
_ou_mean = only(unique(X))
var(X; mean=_ou_mean)
catch _
var(X)
end
ou_var (generic function with 1 method)
julia> ou_var(M)
0.0
As already noted you have a floating-point roundoff error already in the mean computation. If you’re willing to spend more calculation time you can improve on this with a correction term:
julia> M[1]
5.421241248937502
julia> μ₁ = mean(M)
5.421241248937496
julia> μ₂ = μ₁ + mean(M .- μ₁)
5.421241248937502
and you can integrate this correction into the variance computation simply by
julia> var(M .- mean(M))
0.0
Calculating variance accurately can take a bit of care. In particular, @GunnarFarneback
’s reccomendation looks pretty good (EDIT: although the builtin algorithm is already quite reliable, so unless this nonzero-for-constant-inputs is a problem, I would just use var
as-is). If you want a cheaper but marginally less accurate calculation, var(M .- first(M))
(or offset by any other convex combination of elements of M
) is also decent (not only in this case).
It appears that Statistics
uses the Welford algorithm when the mean is not provided and the target is an general iterator. For AbstractArrays
, however, it uses a two-pass algorithm. Obviously, the Welford algorithm is exact in this special case:
julia> var(M)
3.865805016084566e-29
julia> var(Iterators.take(M, length(M)))
0.0
I can’t say which should be preferred in general. Clearly, someone made the deliberate choice to use a two-pass algorithm for AbstractArray
. Given the two-pass’s cost, I must imagine it is slightly preferable in most situations.
Note that the built-in algorithms take this care, and are numerically stable.
In the example above, the backwards error is extremely small already. It’s not clear to me in what context this error is a practical problem, or if it’s just one of those cases where people are surprised by the existence of a round off error.
There are ways to get exactly rounded results, of course, but they are only worth it in rare circumstances.
To elaborate, the correction term mean(M .- μ₁)
works because it’s summing much smaller values (ideally 0 variance after all) than mean(M)
did and can thus represent the small error. The correction is less helpful with larger variances:
julia> using Statistics, Random
julia> function testmean(intended, offset, halfn::Integer)
M = shuffle!(vcat(fill(intended-offset, halfn),
fill(intended+offset, halfn)))
mu = mean(M)
correction = mean(M .- mu)
mu, correction, mu+correction
end
testmean (generic function with 1 method)
julia> testmean(5.3, 0, 5000)
(5.300000000000007, -7.105427357601002e-15, 5.3)
julia> testmean(5.3, 50, 5000) # can overcorrect
(5.3000000000000025, -3.819877747446298e-15, 5.299999999999999)
julia> testmean(5.3, 100, 5000) # doesn't correct at all
(5.299999999999995, 0.0, 5.299999999999995)