# Mean of integers overflows - bug or expected behaviour?

The function mean(), when used on Int inputs, returns a float result.
However, it may silently overflow on integer arithmetic.

``````julia> mean([1,1]*typemax(Int))
-1
``````

Maintainer of Statistics.jl closed
the issue, saying that this is an expected behaviour of integer arithmetic. However,
the output of mean() is not expected to be an integer, so I feel that this is
not a good decision.

https://docs.julialang.org/en/v1/manual/faq/#faq-integer-arithmetic-1

EDIT I see this was linked in the issue so you may have seen it; so maybe you are looking for a more detailed explanation.

It is true that the result of `mean` may not be an integer, but `sum` can be, and the overflow happens there.

In Julia, the preferred way to protect against overflow etc is to choose the right numeric type explicitly. So if you are taking the mean of integers, you can just convert to `Float64`, `Int128`, etc.

Juliaâ€™s built-in integers will always overflow silently for performance reasons, and `mean` wonâ€™t auto-convert its arguments to floating-point numbers for the same reason. If overflow safety is important to you, consider https://github.com/JeffreySarnoff/SaferIntegers.jl.

Unlike sum(), the function mean() will always convert to float even when
the result is integer. So conversion happens after summing, which sets
up a user for a nasty surprise.

EDIT: I understand the explanation, but I believe it prioritizes minuscule performance
savings over significant time that will undoubtedly be wasted by
tripping up users with this behaviour. Moreover, this misbehaviour propagates to other functions that rely on mean, like std()

Mean auto-converts its result to float even when it is an integer value, though.

I agree that since `mean` computes a floating-point result, it should avoid spurious overflow by doing the sum operation in floating-point arithmetic as well.

14 Likes

The conversion operation (`vcvtsi2sdq` on my machine) is expensive compared to `vaddsd`, which leads to a ~6x slowdown for `Int64` -> `Float64` conversion. Iâ€™m not sure thatâ€™s worth it when the safe behavior is so easily accessible with `mean(float, v)`, although itâ€™d be good to add a note of warning to the documentation.

``````julia> @btime mean(v) setup=(v=rand(Int64,10000));
1.008 ÎĽs (0 allocations: 0 bytes)

julia> @btime mean(v) setup=(v=rand(Float64,10000));
1.295 ÎĽs (0 allocations: 0 bytes)

julia> @btime mean(float, v) setup=(v=rand(Int64,10000));
7.961 ÎĽs (3 allocations: 48 bytes)

julia> @btime mean(float, v) setup=(v=rand(Float64,10000));
1.333 ÎĽs (3 allocations: 48 bytes)
``````
2 Likes

Note that in some scenarios, this would lose precision that is currently available, eg

``````using Statistics
a = [8315191790773418, 41632452001999347, 25329529425389991, 123030457614227249,
80417688995052453, 41883275111226319, 85784877512947606, 120202148107616837,
99452431972860543, 115850321730795284]
mean(Float64.(a)) - mean(a) # -16.0
``````

as it converts an inherently associative summation problem to a tricky one with floating point.

2 Likes

The loss of precision on some selected
values is not as unexpected to a user as silent integer overfloat
(the latter also depends on whether OS is 64 or 32 bit, another
â€śinterestingâ€ť behaviour)

The result of mean() as of now is never an integer, but always a float.
It may contain an integer value which is not at all the same.

The Julia intends to be the replacement for Matlab, and Matlab would
never surprise a user in this way.

``````>> class([intmax,intmax])

ans =

int32

>> mean([intmax,intmax])

ans =

2.147483647000000e+09

>> intmax

ans =

2147483647
``````

Itâ€™s hard to know in advance what will surprise users the most. At least we cannot assume they all come from MATLAB (Juliaâ€™s goal is not just to be a MATLAB replacement).

One could almost say that an obviously incorrect result can sometimes be better than a slightly incorrect one, as the former will more obviously be noticed. Suprising people with a visible problem isnâ€™t as bad as silently misleading them.

Anyway, the performance issue raised by @stillyslalom is serious and needs to be addressed before considering changing the behavior. A 6Ă— penalty is not the kind of hit we usually accept in Julia for increased safety.

9 Likes

To be fair, Matlab has a whole lot of other â€śsurprisesâ€ť to make up for this. Arguments to copy what Matlab does do not always achieve the desired effect on this forum, but YMMV.

FWIW, I find the whole â€śsurpriseâ€ť line of reasoning specious: people have heterogeneous expectations, and someone is always going to be surprised about something. This is not a solid basis for API design.

I would argue for consistency instead. In Julia, `Base`, the standard libraries, and most packages rarely convert `Int` to `Float64` to just to avoid overflow, leaving this choice to the user instead.

The basic problem here is calling `mean` on integers, which is a very peculiar use case when you think about it. If you are working with eg measurement data that has large values, they are very rarely integers in practice.

7 Likes

The case of mean() is quite different, because it always converts to float. Consistency would seem to demands that it does the conversion before sum().

That argument prioritizes performance over correctness.

1 Like

What is the OS and CPU on your machine?

How representative are those results for other processors and OS?

Itâ€™s possible to both keep full accuracy and avoid overflow with:

``````mean(a::Vector{Int64}) = reduce(+, a, init=zero(Int128))/length(a)
``````

This also seems to be faster than converting each element to `Float64` in my (simplistic) benchmarking.

1 Like

Yes, but thatâ€™s precisely the kind of tradeoff that Julia makes when itâ€™s really worth it (the case in point being precisely integer addition). A 6Ă— performance gain is probably considered as â€śworth itâ€ť.

OS shouldnâ€™t be irrelevant, but FWIW I see roughly the same result on a three-year-old Intel Core i5 (Skylake).

Thatâ€™s an interesting solution! Now the problem with that approach is that it doesnâ€™t work for types larger than `Int`, which would have to use another approach. `BigInt` can always be used, but it will be quite slow. Overall this makes either the result or performance harder to predict than a simple rule like overflowing (current approach) or accumulating in the type of the result (proposed approach in this thread).

2 Likes

Integer addition is not a good parallel, because the integer overflow is not unexpected there and there is no logical alternative to it. That is why I raise the issue with respect to mean() rather than sum().

That solution seems to me a good compromise between performance and correctness.

Thatâ€™s not correct. Outputs of analog-to-digital converters are integer, and they are the most typical measurement data.

1 Like

Horses for courses. I deal with thousands of different measurements every day in my work. Temperatures, pressures, flows, compositions, reaction rates etc. In more than 20 years I have yet to find one of them that was an integer.

Now, obviously for you the reality is different, but be careful when you claim universal truths.

4 Likes