When you have a mixed-type array, there is some ambiguity about the desired output type (which will not be type-stable in any case). It’s true that my code converts to floating point more eagerly for Any arrays, but this seems like a defensible choice to me.
The basic principle is to do the accumulation using the output type.
What if the input is BigInt
? Would not this lead to a terrible loss of accuracy?
If you compute a function with a floating point result, you should expect floating point accuracy characteristics.
If there are number theorists who need exact statistical results, they can do sum(x)//length(x)
to get a Rational{BigInt}
. But I’m skeptical that this is a large group of people.
I other words, you prefer to ignore the existence of BigInt
s. Would it not be more reasonable to use dispatching on types to use a different algorithm in that case?
To be fair, the same may apply to the group of people who want to use Int
for measurement data.
In general, I understand that people mean well, but trying to anticipate user expectations gives diminishing returns after a while because different people will just have different expectations (because of their application, field, or previous experience with programming languages).
So we end up with these very long discussions which are basically about conflicting expectations and preferences. I guess most participants recognize that there is some validity to all expectations. It’s just that they have different weights on them. Lacking a representative sample about Julia users, we can’t even target some “social optimum” (even if that made sense conceptually).
I would instead argue that we
-
design building blocks that have a simple mental model the user can learn and reason about. Yes, it will not be “correct” 100% of the time, but if these building blocks are sufficiently simple, orthogonal, and not too numerous, the users will just learn about them,
-
clarify to new users that this is how Julia is intended to work, because this approach meshes very well with multiple dispatch and the inherent composability offered by the language.
For this particular case, I think that the status quo fulfills these requirements: the three principles that need to be understood are
-
+
and how it behaves forInt
,Float64
, etc, -
sum
is mostly a reduction with+
(cfBase.add_sum
), -
mean(x) = sum(x) / length(x)
(or near-equivalent when that makes sense).
Then instead of the language anticipating user expectations, the user can easily tailor the application to a particular expectation, harnessing the power of the language and common packages, eg
-
mean(x)
— I want it to be fast and I understand the implications (?sum
talks about overflow) -
mean(float, x)
— makes sure floating point arithmetic is done, may not be the highest precision though -
using KahanSummation; sum_kbn(Float64.(x)) / length(x)
— I want the highest precision and I am willing to pay for it in performance -
mean(BigInt, x)
— I use large integers, you insensitive clod -
mean(BigFloat, x)
— ditto
No. If you want exact results with BigInt
, you shouldn’t be using a function that computes a floating-point result.
This has nothing to do with multiple dispatch. When you have a function with inputs of type T
and outputs of type S
, there is nothing inherent to Julia about computing the results with the precision of T
rather than S
.
(Note that mean
with a dims
argument already computes using the precision of the output type.)
I was talking about the composition of building blocks with simple behavior, which I think does.
Specifically, here the behavior of mean
follows from sum
.
That’s an interesting inconsistency, which I would call a bug. FWIW, I think that having versions of functions with a dims
argument is a kludge from the early days of Julia, which I hope will be replaced by just using eachslice
etc one day.
As I said above, this is a circular argument: you’re saying that mean
is implemented the way it is because it is “defined” by it’s current implementation.
The whole question here is in which number system mean
should be computed: that of its input or its output.
I think you misunderstand my point: the argument is not specifically that mean
should be defined based on sum
, but that it is advantageous when we can simplify the semantics to simple rules. Here, the point is that if I know how sum
works, I have a good understanding of mean
almost automatically without learning anything extra.
In this case basing mean
on sum
is a natural choice because doing it the other way round would introduce numerical error in most cases, but generally the important point is unifying semantics.
Computing the sum in the precision of the output (i.e. floating point) would not introduce numerical error in “most cases”. It gives the same result for integers \lesssim 2^{53} (where Float64
is exact), whereas if you have integers larger than that you are quite likely to overflow Int
. So, I would say that for most cases it has equal or better accuracy than Int
summation, and the potential inaccuracies are much smaller.
Since every subsequent statistical calculation following mean
will use floating-point arithmetic, doing the sum
using the input arithmetic leads to completely unnatural results, for example:
julia> var([typemax(Int), typemax(Int)])
1.7014118346046923e38
While you could claim that this is consistent with our “definition” (i.e. implementation) of mean
, I think it is hard to argue that the variance of identical inputs should not be zero or nearly so (up to roundoff error — i.e. it should at least be small compared to norm(input)^2
).
In summary, doing mean
with the sum
using the input precision (i.e. the current behavior) has the following pros and cons compared to using the output precision (the proposed behavior). Con:
- Much more susceptible to catastrophic overflow—which spills over into subsequent calculations like
var
andstd
—when using hardware integer types, especially in 32-bit Julia.
Pros:
- Faster for
Int
(andIntXX
) inputs. - Slightly more accurate for some inputs \gtrsim 2^{53} (the improvement is small compared to
maximum(abs, input)
, and the output is not generally exact anyway since it is still floating-point in the end). I say “some” because if you have such large inputs and are usingInt
then you are likely to hit overflow.- Even for
BigInt
inputs (where the output is aBigFloat
in the current precision), the improvement in accuracy is small compared tomaximum(abs, input)
— if you need an exact result, you have to switch to rational calculations anyway.
- Even for
- Implementation is easier.
To me, robustness seems vastly preferable to speed here, especially since in most real applications the speed of mean
is unlikely to be limiting, and the claimed accuracy advantages of Int
are a mirage.
The big advantage of using int instead of float is that for values between 2^53 and 2^60, int will often not overflow unless you have a very large list. While this may not seem like a lot, that means you get accurate behavior for 32x more numbers.
julia> div(typemax(Int64), 2^53)
1023
I would not say that 1000 numbers is “a very large list”. Doing statistical calculations with Int
values that large is an overflow waiting to happen.
Perhaps I did not frame my argument clearly — I was arguing for consistency between sum
and mean
, not accuracy (in the mathematical sense). I recognize that what you are proposing is more accurate (if we consider computer representations of numbers as real numbers) in most cases.
I was also arguing for consistency in API design. That is, if all implementations of mean
are expected to make a reasonable effort to protect from overflow and this becomes part of the API, then it is a useful feature because the caller can depend on it. If this is not to be expected generally, then I cannot write generic code like mean(x)
without thinking about what x
is, so it not nearly as useful if it just works for a specific case.
(I also realized that if this change goes through, I can always get the current behavior back with sum(x) / length(x)
in case I need it, so there is no loss.)
As is usually the case, I think this is an interesting discussion.
My first reaction was, “Take mean
out of Base
!” so I started out happy to see that mean
is not in Base and we are talking about a package, Statistics.jl
, which makes things easier to resolve.
I am onboard with @stevengj here. He makes two eloquent points that are difficult to argue against:
- The relevant comparison here is with
norm
. If we want a fastmean
with possibility of overflow, then for consistency, we should have a fastnorm
with possibility of overflow. So either makenorm
fast or makemean
slow. Mixing the two seems inconsistent philosophically. - The accuracy / robustness should be consistent with the output rather than the input.
The good news is this is a package (albeit a standard library) so, in principle, you can fix this with some global setting, for example STATISTICS_UNSAFE = true
, or something, and make the defaults safe and perhaps a bit slower with an option to make them unsafe and fast. Maybe someone does want a fast norm
and is not concerned about overflow.
sum
of an integer array has an integer output. mean
of an integer array has a floating-point output. As a result, I don’t think there is any consistency requirement that they compute using the same arithmetic type.
And yes, I think “not wildly inaccurate given the output precision” (or more technically: numerically stable, with a forward error proportional to the output precision and condition number) should be an implicit part of the API for almost any function with a floating-point result (unless it is explicitly documented as numerically unstable, caveat emptor), and deviations from that expectation should generally be viewed as a bug.
They can just do sqrt(sum(abs2, x))
then; no need for a package setting. Similarly, if someone wants an unsafe mean of integer inputs they can do sum(x)/length(x)
as @Tamas_Papp suggests. You use libraries for convenience, but also because libraries handle corner cases and other pitfalls that a naive implementation might stumble into — a library is precisely the place for a nontrivial robust algorithm.