Mean of integers overflows - bug or expected behaviour?

The function mean(), when used on Int inputs, returns a float result.
However, it may silently overflow on integer arithmetic.

julia> mean([1,1]*typemax(Int))
-1

Maintainer of Statistics.jl closed
the issue, saying that this is an expected behaviour of integer arithmetic. However,
the output of mean() is not expected to be an integer, so I feel that this is
not a good decision.

https://docs.julialang.org/en/v1/manual/faq/#faq-integer-arithmetic-1

EDIT I see this was linked in the issue so you may have seen it; so maybe you are looking for a more detailed explanation.

It is true that the result of mean may not be an integer, but sum can be, and the overflow happens there.

In Julia, the preferred way to protect against overflow etc is to choose the right numeric type explicitly. So if you are taking the mean of integers, you can just convert to Float64, Int128, etc.

Julia’s built-in integers will always overflow silently for performance reasons, and mean won’t auto-convert its arguments to floating-point numbers for the same reason. If overflow safety is important to you, consider https://github.com/JeffreySarnoff/SaferIntegers.jl.

Unlike sum(), the function mean() will always convert to float even when
the result is integer. So conversion happens after summing, which sets
up a user for a nasty surprise.

EDIT: I understand the explanation, but I believe it prioritizes minuscule performance
savings over significant time that will undoubtedly be wasted by
tripping up users with this behaviour. Moreover, this misbehaviour propagates to other functions that rely on mean, like std()

Mean auto-converts its result to float even when it is an integer value, though.

I agree that since mean computes a floating-point result, it should avoid spurious overflow by doing the sum operation in floating-point arithmetic as well.

14 Likes

The conversion operation (vcvtsi2sdq on my machine) is expensive compared to vaddsd, which leads to a ~6x slowdown for Int64 → Float64 conversion. I’m not sure that’s worth it when the safe behavior is so easily accessible with mean(float, v), although it’d be good to add a note of warning to the documentation.

julia> @btime mean(v) setup=(v=rand(Int64,10000));
  1.008 ÎĽs (0 allocations: 0 bytes)

julia> @btime mean(v) setup=(v=rand(Float64,10000));
  1.295 ÎĽs (0 allocations: 0 bytes)

julia> @btime mean(float, v) setup=(v=rand(Int64,10000));
  7.961 ÎĽs (3 allocations: 48 bytes)

julia> @btime mean(float, v) setup=(v=rand(Float64,10000));
  1.333 ÎĽs (3 allocations: 48 bytes)
3 Likes

Note that in some scenarios, this would lose precision that is currently available, eg

using Statistics
a = [8315191790773418, 41632452001999347, 25329529425389991, 123030457614227249,
     80417688995052453, 41883275111226319, 85784877512947606, 120202148107616837,
     99452431972860543, 115850321730795284]
mean(Float64.(a)) - mean(a) # -16.0

as it converts an inherently associative summation problem to a tricky one with floating point.

3 Likes

The loss of precision on some selected
values is not as unexpected to a user as silent integer overfloat
(the latter also depends on whether OS is 64 or 32 bit, another
“interesting” behaviour)

I think that your own earlier reply highlights the problem

The result of mean() as of now is never an integer, but always a float.
It may contain an integer value which is not at all the same.

The Julia intends to be the replacement for Matlab, and Matlab would
never surprise a user in this way.

>> class([intmax,intmax])

ans =

int32

>> mean([intmax,intmax])

ans =

     2.147483647000000e+09

>> intmax

ans =

  2147483647

It’s hard to know in advance what will surprise users the most. At least we cannot assume they all come from MATLAB (Julia’s goal is not just to be a MATLAB replacement).

One could almost say that an obviously incorrect result can sometimes be better than a slightly incorrect one, as the former will more obviously be noticed. Suprising people with a visible problem isn’t as bad as silently misleading them.

Anyway, the performance issue raised by @stillyslalom is serious and needs to be addressed before considering changing the behavior. A 6Ă— penalty is not the kind of hit we usually accept in Julia for increased safety.

10 Likes

To be fair, Matlab has a whole lot of other “surprises” to make up for this. Arguments to copy what Matlab does do not always achieve the desired effect on this forum, but YMMV.

FWIW, I find the whole “surprise” line of reasoning specious: people have heterogeneous expectations, and someone is always going to be surprised about something. This is not a solid basis for API design.

I would argue for consistency instead. In Julia, Base, the standard libraries, and most packages rarely convert Int to Float64 to just to avoid overflow, leaving this choice to the user instead.

The basic problem here is calling mean on integers, which is a very peculiar use case when you think about it. If you are working with eg measurement data that has large values, they are very rarely integers in practice.

8 Likes

The case of mean() is quite different, because it always converts to float. Consistency would seem to demands that it does the conversion before sum().

1 Like

That argument prioritizes performance over correctness.

2 Likes

What is the OS and CPU on your machine?

How representative are those results for other processors and OS?

It’s possible to both keep full accuracy and avoid overflow with:

mean(a::Vector{Int64}) = reduce(+, a, init=zero(Int128))/length(a)

This also seems to be faster than converting each element to Float64 in my (simplistic) benchmarking.

2 Likes

Yes, but that’s precisely the kind of tradeoff that Julia makes when it’s really worth it (the case in point being precisely integer addition). A 6× performance gain is probably considered as “worth it”.

OS shouldn’t be irrelevant, but FWIW I see roughly the same result on a three-year-old Intel Core i5 (Skylake).

That’s an interesting solution! Now the problem with that approach is that it doesn’t work for types larger than Int, which would have to use another approach. BigInt can always be used, but it will be quite slow. Overall this makes either the result or performance harder to predict than a simple rule like overflowing (current approach) or accumulating in the type of the result (proposed approach in this thread).

3 Likes

Integer addition is not a good parallel, because the integer overflow is not unexpected there and there is no logical alternative to it. That is why I raise the issue with respect to mean() rather than sum().

1 Like

That solution seems to me a good compromise between performance and correctness.

That’s not correct. Outputs of analog-to-digital converters are integer, and they are the most typical measurement data.

2 Likes

Horses for courses. I deal with thousands of different measurements every day in my work. Temperatures, pressures, flows, compositions, reaction rates etc. In more than 20 years I have yet to find one of them that was an integer.

Now, obviously for you the reality is different, but be careful when you claim universal truths.

4 Likes