Mean of integers overflows - bug or expected behaviour?

Mikhail_Kagalenko · February 17, 2020, 4:54pm

The function mean(), when used on Int inputs, returns a float result.
However, it may silently overflow on integer arithmetic.

julia> mean([1,1]*typemax(Int))
-1

Maintainer of Statistics.jl closed
the issue, saying that this is an expected behaviour of integer arithmetic. However,
the output of mean() is not expected to be an integer, so I feel that this is
not a good decision.

Tamas_Papp · February 17, 2020, 5:04pm

https://docs.julialang.org/en/v1/manual/faq/#faq-integer-arithmetic-1

EDIT I see this was linked in the issue so you may have seen it; so maybe you are looking for a more detailed explanation.

It is true that the result of mean may not be an integer, but sum can be, and the overflow happens there.

In Julia, the preferred way to protect against overflow etc is to choose the right numeric type explicitly. So if you are taking the mean of integers, you can just convert to Float64, Int128, etc.

stillyslalom · February 17, 2020, 5:05pm

Julia’s built-in integers will always overflow silently for performance reasons, and mean won’t auto-convert its arguments to floating-point numbers for the same reason. If overflow safety is important to you, consider https://github.com/JeffreySarnoff/SaferIntegers.jl.

Mikhail_Kagalenko · February 17, 2020, 5:13pm

Unlike sum(), the function mean() will always convert to float even when
the result is integer. So conversion happens after summing, which sets
up a user for a nasty surprise.

EDIT: I understand the explanation, but I believe it prioritizes minuscule performance
savings over significant time that will undoubtedly be wasted by
tripping up users with this behaviour. Moreover, this misbehaviour propagates to other functions that rely on mean, like std()

Mikhail_Kagalenko · February 17, 2020, 5:14pm

Mean auto-converts its result to float even when it is an integer value, though.

stevengj · February 17, 2020, 5:43pm

I agree that since mean computes a floating-point result, it should avoid spurious overflow by doing the sum operation in floating-point arithmetic as well.

stillyslalom · February 18, 2020, 1:06am

The conversion operation (vcvtsi2sdq on my machine) is expensive compared to vaddsd, which leads to a ~6x slowdown for Int64 → Float64 conversion. I’m not sure that’s worth it when the safe behavior is so easily accessible with mean(float, v), although it’d be good to add a note of warning to the documentation.

julia> @btime mean(v) setup=(v=rand(Int64,10000));
  1.008 μs (0 allocations: 0 bytes)

julia> @btime mean(v) setup=(v=rand(Float64,10000));
  1.295 μs (0 allocations: 0 bytes)

julia> @btime mean(float, v) setup=(v=rand(Int64,10000));
  7.961 μs (3 allocations: 48 bytes)

julia> @btime mean(float, v) setup=(v=rand(Float64,10000));
  1.333 μs (3 allocations: 48 bytes)

Tamas_Papp · February 18, 2020, 9:13am

Note that in some scenarios, this would lose precision that is currently available, eg

using Statistics
a = [8315191790773418, 41632452001999347, 25329529425389991, 123030457614227249,
     80417688995052453, 41883275111226319, 85784877512947606, 120202148107616837,
     99452431972860543, 115850321730795284]
mean(Float64.(a)) - mean(a) # -16.0

as it converts an inherently associative summation problem to a tricky one with floating point.

Mikhail_Kagalenko · February 18, 2020, 10:18am

The loss of precision on some selected
values is not as unexpected to a user as silent integer overfloat
(the latter also depends on whether OS is 64 or 32 bit, another
“interesting” behaviour)

I think that your own earlier reply highlights the problem

The result of mean() as of now is never an integer, but always a float.
It may contain an integer value which is not at all the same.

The Julia intends to be the replacement for Matlab, and Matlab would
never surprise a user in this way.

>> class([intmax,intmax])

ans =

int32

>> mean([intmax,intmax])

ans =

     2.147483647000000e+09

>> intmax

ans =

  2147483647

nalimilan · February 18, 2020, 10:36am

It’s hard to know in advance what will surprise users the most. At least we cannot assume they all come from MATLAB (Julia’s goal is not just to be a MATLAB replacement).

One could almost say that an obviously incorrect result can sometimes be better than a slightly incorrect one, as the former will more obviously be noticed. Suprising people with a visible problem isn’t as bad as silently misleading them.

Anyway, the performance issue raised by @stillyslalom is serious and needs to be addressed before considering changing the behavior. A 6× penalty is not the kind of hit we usually accept in Julia for increased safety.

Tamas_Papp · February 18, 2020, 10:39am

To be fair, Matlab has a whole lot of other “surprises” to make up for this. Arguments to copy what Matlab does do not always achieve the desired effect on this forum, but YMMV.

FWIW, I find the whole “surprise” line of reasoning specious: people have heterogeneous expectations, and someone is always going to be surprised about something. This is not a solid basis for API design.

I would argue for consistency instead. In Julia, Base, the standard libraries, and most packages rarely convert Int to Float64 to just to avoid overflow, leaving this choice to the user instead.

The basic problem here is calling mean on integers, which is a very peculiar use case when you think about it. If you are working with eg measurement data that has large values, they are very rarely integers in practice.

Mikhail_Kagalenko · February 18, 2020, 1:28pm

The case of mean() is quite different, because it always converts to float. Consistency would seem to demands that it does the conversion before sum().

Mikhail_Kagalenko · February 18, 2020, 1:30pm

That argument prioritizes performance over correctness.

Mikhail_Kagalenko · February 18, 2020, 1:37pm

What is the OS and CPU on your machine?

How representative are those results for other processors and OS?

Per · February 18, 2020, 2:47pm

It’s possible to both keep full accuracy and avoid overflow with:

mean(a::Vector{Int64}) = reduce(+, a, init=zero(Int128))/length(a)

This also seems to be faster than converting each element to Float64 in my (simplistic) benchmarking.

nalimilan · February 18, 2020, 2:56pm

Yes, but that’s precisely the kind of tradeoff that Julia makes when it’s really worth it (the case in point being precisely integer addition). A 6× performance gain is probably considered as “worth it”.

OS shouldn’t be irrelevant, but FWIW I see roughly the same result on a three-year-old Intel Core i5 (Skylake).

Per:

It’s possible to both keep full accuracy and avoid overflow with:
mean(a::Vector{Int64}) = reduce(+, a, init=zero(Int128))/length(a)
This also seems to be faster than converting each element to Float64 in my (simplistic) benchmarking.

That’s an interesting solution! Now the problem with that approach is that it doesn’t work for types larger than Int, which would have to use another approach. BigInt can always be used, but it will be quite slow. Overall this makes either the result or performance harder to predict than a simple rule like overflowing (current approach) or accumulating in the type of the result (proposed approach in this thread).

Mikhail_Kagalenko · February 18, 2020, 2:58pm

Integer addition is not a good parallel, because the integer overflow is not unexpected there and there is no logical alternative to it. That is why I raise the issue with respect to mean() rather than sum().

Mikhail_Kagalenko · February 18, 2020, 3:03pm

That solution seems to me a good compromise between performance and correctness.

Mikhail_Kagalenko · February 18, 2020, 3:05pm

That’s not correct. Outputs of analog-to-digital converters are integer, and they are the most typical measurement data.

braamvandyk · February 18, 2020, 3:52pm

Horses for courses. I deal with thousands of different measurements every day in my work. Temperatures, pressures, flows, compositions, reaction rates etc. In more than 20 years I have yet to find one of them that was an integer.

Now, obviously for you the reality is different, but be careful when you claim universal truths.

Topic		Replies	Views
Inconsistent behavior of `sum`,`mean` (and probably others) on different collection types Internals & Design	31	3531	September 19, 2017
A plea for int overflow checking as the default Internals & Design question , proposal , integer-overflow	80	11759	December 17, 2021
Discussion about integer overflow Internals & Design numbers , integer-overflow	189	7666	October 15, 2021
Potential solution to the overflow problem; 64-bit considered harmful, 32- or 21-bit better General Usage integer-overflow	6	3639	October 18, 2021
Julia messes up integer exponents? Performance integer-overflow	106	6462	February 21, 2019

Mean of integers overflows - bug or expected behaviour?

Related topics