Mean of integers overflows - bug or expected behaviour?

In floating-point computation it is usually accepted to prioritize correctness over performance and simplicity of design. For example, Kahan’s summation formula increases operation count, whereas gradual underflow complicates the design of floating-point hardware to ensure the correctness of error analysis.

We don’t use Kahan summation by default, we use pairwise summation, which is almost as accurate but is nearly as fast as naive summation.

In general, there is a tradeoff — numerical libraries don’t always use the fastest or the most accurate possible algorithm. However, one generally defaults to an algorithm that is numerically stable and minimizes the chance of spurious overflow while still giving reasonable performance.

I think the comparison with norm is the most pertinent here.

3 Likes

That’s not generally true—it’s a balance. For example, sum of floating point numbers does not use an exact algorithm even though they exist because it’s way too slow. Implementations of transcendental functions are generally accurate within 1ulp even though it’s possible to be fully correctly rounded, again, because it’s too slow to be fully accurate. And it’s a matter of perspective whether this is a float computation it and integer one.

1 Like

If it gives a floating point result always, it must be treated as a floating point computation. Otherwise you guarantee for yourself a steady stream of quite unhappy numerics people complaining about this behaviour.

If it’s a floating point computation, why are the inputs integers? It’s just not as cut and dried as you’re making it out to be.

I don’t disagree that it would be good to be more accurate if it can be done without a large performance hit, but so far no one has come up with a way to do that.

1 Like

computing mean of integers with floats also produces spurious results:

julia> x = [typemax(Int), -3, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                   -3
 -9223372036854775807

julia> sum(float, x)/length(x)
0.0

julia> sum(x)/length(x)
-1.0

so now for this example I prefer the “integer sum” behaviour, what do You think @Mikhail_Kagalenko? :slight_smile:

julia> x = [typemax(Int), 1023, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                 1023
 -9223372036854775807

julia> sum(float, x)/length(x)
0.0

julia> sum(x)/length(x)
341.0

julia> x = [typemax(Int), 1025, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                 1025
 -9223372036854775807

julia> sum(float, x)/length(x)
682.6666666666666

julia> sum(x)/length(x)
341.6666666666667

:wink:

@stevengj’s proposed fix also gives -1 on your example

ditto for this one (gives 341.3333333333333)

ok, then I must be looking at a different function:

julia> x = [typemax(Int), -3, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                   -3
 -9223372036854775807

julia> function _mean(A::AbstractArray, ::Colon)
           isempty(A) && return sum(A) / length(A) # or maybe: throw(ArgumentError("mean requires non-empty array"))
           n = length(A)
           x1 = first(A) / n
           return sum(x -> first(promote(x,x1)), A) / n
       end
_mean (generic function with 1 method)

julia> _mean(x, :)
0.0

I will look why my implementation gives better results, but in the meantime, your examples just are more versions of the earlier objection which has been answered

EDIT: Ah yes, apologies, it does give zeros. And yes, this sort of thing is expected in floating-point computations, and I personally much prefer it to integer overflow.

I see, then I misunderstood the reason of the discussion. I thought the issue was that calculating mean with integer sum was overflowing, hence “less correct” than with floats. This was meant as an example of the exactly opposite.

This was meant as an example of the exactly opposite.

Your example is mean() computation obeying the bound on its accuracy that is obtainable by the error analysis, as explained in the earlier answer that I linked.

I did;

Since we’re discussing which overflows we prefer, I have nothing more to add :wink:

Error in mean(float,x) is not due to overflow.

I have nothing more to add

What, you run out of emoticons?

The arguments to mean are usually some measurements. For me, it seems more correct to have float rounding error than integer over float.
To my understanding, the float rounding error for addition is more severe if there are both positive and negative measurements with large absolute values or if there are measurements of very different magnitudes. In these cases, the use of mean should already be handled with care, even it is 100% accurate, as it can be quite misleading to use mean alone.

1 Like

What you are proposing is of course not difficult to understand for this specific case, it is just unclear how this would generalize.

Lots of things have a mean, eg

julia> mean([[1, 2], [3, 4]])
2-element Array{Float64,1}:
 2.0
 3.0

Should these also convert the elements to float?

I would like to understand what exactly is being proposed:

  1. special-casing <:AbstractVector{Int},
  2. a generic promotion framework (how would it work?)
  3. something else?

All the solutions I can imagine violate some invariant that some users will consider “surprising” (I am happy to give examples if someone proposes something concrete). Maybe I am not seeing an obvious and clean solution though — I would like to hear about it.

1 Like

Yes, and my proposed implementation already does (in a completely generic way). It’s not complicated: The basic principle is to do the accumulation using the output type.

6 Likes

Sorry I missed your suggestion in that issue, thanks for pointing it out.

It is indeed a neat implementation, but I am not sure it always does the right thing, especially with Any, eg

julia> using Statistics

julia> v = Any[1, 3//1, 0]
3-element Array{Any,1}:
  1
 3//1
  0

julia> _mean(v, Colon())
1.3333333333333333

julia> mean(v)
4//3

Who’s to say this is “the” right thing?