Mean of integers overflows - bug or expected behaviour?

When you have a mixed-type array, there is some ambiguity about the desired output type (which will not be type-stable in any case). It’s true that my code converts to floating point more eagerly for Any arrays, but this seems like a defensible choice to me.

The basic principle is to do the accumulation using the output type.

What if the input is BigInt? Would not this lead to a terrible loss of accuracy?

If you compute a function with a floating point result, you should expect floating point accuracy characteristics.

If there are number theorists who need exact statistical results, they can do sum(x)//length(x) to get a Rational{BigInt}. But I’m skeptical that this is a large group of people.

I other words, you prefer to ignore the existence of BigInts. Would it not be more reasonable to use dispatching on types to use a different algorithm in that case?

To be fair, the same may apply to the group of people who want to use Int for measurement data.

In general, I understand that people mean well, but trying to anticipate user expectations gives diminishing returns after a while because different people will just have different expectations (because of their application, field, or previous experience with programming languages).

So we end up with these very long discussions which are basically about conflicting expectations and preferences. I guess most participants recognize that there is some validity to all expectations. It’s just that they have different weights on them. Lacking a representative sample about Julia users, we can’t even target some “social optimum” (even if that made sense conceptually).

I would instead argue that we

  1. design building blocks that have a simple mental model the user can learn and reason about. Yes, it will not be “correct” 100% of the time, but if these building blocks are sufficiently simple, orthogonal, and not too numerous, the users will just learn about them,

  2. clarify to new users that this is how Julia is intended to work, because this approach meshes very well with multiple dispatch and the inherent composability offered by the language.

For this particular case, I think that the status quo fulfills these requirements: the three principles that need to be understood are

  1. + and how it behaves for Int, Float64, etc,

  2. sum is mostly a reduction with + (cf Base.add_sum),

  3. mean(x) = sum(x) / length(x) (or near-equivalent when that makes sense).

Then instead of the language anticipating user expectations, the user can easily tailor the application to a particular expectation, harnessing the power of the language and common packages, eg

  1. mean(x) — I want it to be fast and I understand the implications (?sum talks about overflow)

  2. mean(float, x) — makes sure floating point arithmetic is done, may not be the highest precision though

  3. using KahanSummation; sum_kbn(Float64.(x)) / length(x) — I want the highest precision and I am willing to pay for it in performance

  4. mean(BigInt, x) — I use large integers, you insensitive clod

  5. mean(BigFloat, x) — ditto

5 Likes

No. If you want exact results with BigInt, you shouldn’t be using a function that computes a floating-point result.

This has nothing to do with multiple dispatch. When you have a function with inputs of type T and outputs of type S, there is nothing inherent to Julia about computing the results with the precision of T rather than S.

(Note that mean with a dims argument already computes using the precision of the output type.)

5 Likes

I was talking about the composition of building blocks with simple behavior, which I think does.

Specifically, here the behavior of mean follows from sum.

That’s an interesting inconsistency, which I would call a bug. FWIW, I think that having versions of functions with a dims argument is a kludge from the early days of Julia, which I hope will be replaced by just using eachslice etc one day.

As I said above, this is a circular argument: you’re saying that mean is implemented the way it is because it is “defined” by it’s current implementation.

The whole question here is in which number system mean should be computed: that of its input or its output.

4 Likes

I think you misunderstand my point: the argument is not specifically that mean should be defined based on sum, but that it is advantageous when we can simplify the semantics to simple rules. Here, the point is that if I know how sum works, I have a good understanding of mean almost automatically without learning anything extra.

In this case basing mean on sum is a natural choice because doing it the other way round would introduce numerical error in most cases, but generally the important point is unifying semantics.

Computing the sum in the precision of the output (i.e. floating point) would not introduce numerical error in “most cases”. It gives the same result for integers \lesssim 2^{53} (where Float64 is exact), whereas if you have integers larger than that you are quite likely to overflow Int. So, I would say that for most cases it has equal or better accuracy than Int summation, and the potential inaccuracies are much smaller.

Since every subsequent statistical calculation following mean will use floating-point arithmetic, doing the sum using the input arithmetic leads to completely unnatural results, for example:

julia> var([typemax(Int), typemax(Int)])
1.7014118346046923e38

While you could claim that this is consistent with our “definition” (i.e. implementation) of mean, I think it is hard to argue that the variance of identical inputs should not be zero or nearly so (up to roundoff error — i.e. it should at least be small compared to norm(input)^2).

In summary, doing mean with the sum using the input precision (i.e. the current behavior) has the following pros and cons compared to using the output precision (the proposed behavior). Con:

  • Much more susceptible to catastrophic overflow—which spills over into subsequent calculations like var and std—when using hardware integer types, especially in 32-bit Julia.

Pros:

  • Faster for Int (and IntXX) inputs.
  • Slightly more accurate for some inputs \gtrsim 2^{53} (the improvement is small compared to maximum(abs, input), and the output is not generally exact anyway since it is still floating-point in the end). I say “some” because if you have such large inputs and are using Int then you are likely to hit overflow.
    • Even for BigInt inputs (where the output is a BigFloat in the current precision), the improvement in accuracy is small compared to maximum(abs, input) — if you need an exact result, you have to switch to rational calculations anyway.
  • Implementation is easier.

To me, robustness seems vastly preferable to speed here, especially since in most real applications the speed of mean is unlikely to be limiting, and the claimed accuracy advantages of Int are a mirage.

11 Likes

The big advantage of using int instead of float is that for values between 2^53 and 2^60, int will often not overflow unless you have a very large list. While this may not seem like a lot, that means you get accurate behavior for 32x more numbers.

julia> div(typemax(Int64), 2^53)
1023

I would not say that 1000 numbers is “a very large list”. Doing statistical calculations with Int values that large is an overflow waiting to happen.

1 Like

Perhaps I did not frame my argument clearly — I was arguing for consistency between sum and mean, not accuracy (in the mathematical sense). I recognize that what you are proposing is more accurate (if we consider computer representations of numbers as real numbers) in most cases.

I was also arguing for consistency in API design. That is, if all implementations of mean are expected to make a reasonable effort to protect from overflow and this becomes part of the API, then it is a useful feature because the caller can depend on it. If this is not to be expected generally, then I cannot write generic code like mean(x) without thinking about what x is, so it not nearly as useful if it just works for a specific case.

(I also realized that if this change goes through, I can always get the current behavior back with sum(x) / length(x) in case I need it, so there is no loss.)

As is usually the case, I think this is an interesting discussion.

My first reaction was, “Take mean out of Base!” so I started out happy to see that mean is not in Base and we are talking about a package, Statistics.jl, which makes things easier to resolve.

I am onboard with @stevengj here. He makes two eloquent points that are difficult to argue against:

  1. The relevant comparison here is with norm. If we want a fast mean with possibility of overflow, then for consistency, we should have a fast norm with possibility of overflow. So either make norm fast or make mean slow. Mixing the two seems inconsistent philosophically.
  2. The accuracy / robustness should be consistent with the output rather than the input.

The good news is this is a package (albeit a standard library) so, in principle, you can fix this with some global setting, for example STATISTICS_UNSAFE = true, or something, and make the defaults safe and perhaps a bit slower with an option to make them unsafe and fast. Maybe someone does want a fast norm and is not concerned about overflow.

3 Likes

sum of an integer array has an integer output. mean of an integer array has a floating-point output. As a result, I don’t think there is any consistency requirement that they compute using the same arithmetic type.

And yes, I think “not wildly inaccurate given the output precision” (or more technically: numerically stable, with a forward error proportional to the output precision and condition number) should be an implicit part of the API for almost any function with a floating-point result (unless it is explicitly documented as numerically unstable, caveat emptor), and deviations from that expectation should generally be viewed as a bug.

They can just do sqrt(sum(abs2, x)) then; no need for a package setting. Similarly, if someone wants an unsafe mean of integer inputs they can do sum(x)/length(x) as @Tamas_Papp suggests. You use libraries for convenience, but also because libraries handle corner cases and other pitfalls that a naive implementation might stumble into — a library is precisely the place for a nontrivial robust algorithm.

12 Likes