Mean of integers overflows - bug or expected behaviour?

Mikhail_Kagalenko · February 29, 2020, 8:04pm

In floating-point computation it is usually accepted to prioritize correctness over performance and simplicity of design. For example, Kahan’s summation formula increases operation count, whereas gradual underflow complicates the design of floating-point hardware to ensure the correctness of error analysis.

stevengj · February 29, 2020, 8:07pm

We don’t use Kahan summation by default, we use pairwise summation, which is almost as accurate but is nearly as fast as naive summation.

In general, there is a tradeoff — numerical libraries don’t always use the fastest or the most accurate possible algorithm. However, one generally defaults to an algorithm that is numerically stable and minimizes the chance of spurious overflow while still giving reasonable performance.

I think the comparison with norm is the most pertinent here.

StefanKarpinski · February 29, 2020, 8:09pm

That’s not generally true—it’s a balance. For example, sum of floating point numbers does not use an exact algorithm even though they exist because it’s way too slow. Implementations of transcendental functions are generally accurate within 1ulp even though it’s possible to be fully correctly rounded, again, because it’s too slow to be fully accurate. And it’s a matter of perspective whether this is a float computation it and integer one.

Mikhail_Kagalenko · February 29, 2020, 8:11pm

If it gives a floating point result always, it must be treated as a floating point computation. Otherwise you guarantee for yourself a steady stream of quite unhappy numerics people complaining about this behaviour.

StefanKarpinski · February 29, 2020, 8:17pm

If it’s a floating point computation, why are the inputs integers? It’s just not as cut and dried as you’re making it out to be.

I don’t disagree that it would be good to be more accurate if it can be done without a large performance hit, but so far no one has come up with a way to do that.

abulak · February 29, 2020, 9:10pm

computing mean of integers with floats also produces spurious results:

julia> x = [typemax(Int), -3, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                   -3
 -9223372036854775807

julia> sum(float, x)/length(x)
0.0

julia> sum(x)/length(x)
-1.0

so now for this example I prefer the “integer sum” behaviour, what do You think @Mikhail_Kagalenko?

Elrod · February 29, 2020, 9:19pm

julia> x = [typemax(Int), 1023, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                 1023
 -9223372036854775807

julia> sum(float, x)/length(x)
0.0

julia> sum(x)/length(x)
341.0

julia> x = [typemax(Int), 1025, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                 1025
 -9223372036854775807

julia> sum(float, x)/length(x)
682.6666666666666

julia> sum(x)/length(x)
341.6666666666667

Mikhail_Kagalenko · February 29, 2020, 9:20pm

@stevengj’s proposed fix also gives -1 on your example

Mikhail_Kagalenko · February 29, 2020, 9:21pm

ditto for this one (gives 341.3333333333333)

abulak · February 29, 2020, 9:22pm

ok, then I must be looking at a different function:

julia> x = [typemax(Int), -3, -typemax(Int)]
3-element Array{Int64,1}:
  9223372036854775807
                   -3
 -9223372036854775807

julia> function _mean(A::AbstractArray, ::Colon)
           isempty(A) && return sum(A) / length(A) # or maybe: throw(ArgumentError("mean requires non-empty array"))
           n = length(A)
           x1 = first(A) / n
           return sum(x -> first(promote(x,x1)), A) / n
       end
_mean (generic function with 1 method)

julia> _mean(x, :)
0.0

Mikhail_Kagalenko · February 29, 2020, 9:32pm

I will look why my implementation gives better results, but in the meantime, your examples just are more versions of the earlier objection which has been answered

EDIT: Ah yes, apologies, it does give zeros. And yes, this sort of thing is expected in floating-point computations, and I personally much prefer it to integer overflow.

abulak · February 29, 2020, 9:39pm

I see, then I misunderstood the reason of the discussion. I thought the issue was that calculating mean with integer sum was overflowing, hence “less correct” than with floats. This was meant as an example of the exactly opposite.

Mikhail_Kagalenko · February 29, 2020, 9:44pm

This was meant as an example of the exactly opposite.

Your example is mean() computation obeying the bound on its accuracy that is obtainable by the error analysis, as explained in the earlier answer that I linked.

abulak · February 29, 2020, 11:14pm

I did;

Since we’re discussing which overflows we prefer, I have nothing more to add

Mikhail_Kagalenko · February 29, 2020, 11:23pm

Error in mean(float,x) is not due to overflow.

I have nothing more to add

What, you run out of emoticons?

Non-Contradiction · March 1, 2020, 1:02am

The arguments to mean are usually some measurements. For me, it seems more correct to have float rounding error than integer over float.
To my understanding, the float rounding error for addition is more severe if there are both positive and negative measurements with large absolute values or if there are measurements of very different magnitudes. In these cases, the use of mean should already be handled with care, even it is 100% accurate, as it can be quite misleading to use mean alone.

Tamas_Papp · March 1, 2020, 6:02am

What you are proposing is of course not difficult to understand for this specific case, it is just unclear how this would generalize.

Lots of things have a mean, eg

julia> mean([[1, 2], [3, 4]])
2-element Array{Float64,1}:
 2.0
 3.0

Should these also convert the elements to float?

I would like to understand what exactly is being proposed:

special-casing <:AbstractVector{Int},
a generic promotion framework (how would it work?)
something else?

All the solutions I can imagine violate some invariant that some users will consider “surprising” (I am happy to give examples if someone proposes something concrete). Maybe I am not seeing an obvious and clean solution though — I would like to hear about it.

stevengj · March 1, 2020, 2:50pm

Yes, and my proposed implementation already does (in a completely generic way). It’s not complicated: The basic principle is to do the accumulation using the output type.

Tamas_Papp · March 1, 2020, 3:48pm

Sorry I missed your suggestion in that issue, thanks for pointing it out.

It is indeed a neat implementation, but I am not sure it always does the right thing, especially with Any, eg

julia> using Statistics

julia> v = Any[1, 3//1, 0]
3-element Array{Any,1}:
  1
 3//1
  0

julia> _mean(v, Colon())
1.3333333333333333

julia> mean(v)
4//3

Mikhail_Kagalenko · March 1, 2020, 5:36pm

Who’s to say this is “the” right thing?

Topic		Replies	Views
Inconsistent behavior of `sum`,`mean` (and probably others) on different collection types Internals & Design	31	3558	September 19, 2017
A plea for int overflow checking as the default Internals & Design question , proposal , integer-overflow	80	11817	December 17, 2021
Discussion about integer overflow Internals & Design numbers , integer-overflow	189	7772	October 15, 2021
Potential solution to the overflow problem; 64-bit considered harmful, 32- or 21-bit better General Usage integer-overflow	6	3658	October 18, 2021
Julia messes up integer exponents? Performance integer-overflow	106	6533	February 21, 2019

Mean of integers overflows - bug or expected behaviour?

Related topics