Inconsistent behavior of `sum`,`mean` (and probably others) on different collection types

pint · December 28, 2016, 2:49pm

note, that the input type is now not preserved in general. for iterables, there is a complex mechanism to determine the accumulator type. in particular for integers, it widens to Int32 from smaller sizes, and to Int64 from Int32 or Int64. for non-integers, it becomes typeof(x + zero(x)). this is not the case for tuples, which are simply summed. i would assume it is because summing tuples is not a particularly meaningful operation, but there might be some more specific reason.

StefanKarpinski · December 28, 2016, 3:25pm

At this point, we should really consider removing automatic widening from reduction over numeric types – except Bool which I think should not be numeric anymore anyway. If someone wants a larger result type, it’s better that they specify it, otherwise, we should just give them what normal promotion rules would give them. In particular, for collections of same-type elements, we don’t do promotion anymore for normal summation, etc., and I think reductions should follow that. That said, perhaps we should do accumulation in the result type, so mean over all integer types would accumulate using Float64 because of the integer division at the end.

mzaffalon · December 29, 2016, 6:08am

Why not zero(T)/one(T)?

nalimilan · December 29, 2016, 10:35am

Just in case addition returns a different type. In practice it probably won’t make a difference because of the final division, but who knows what a custom type might want to do…

ChrisRackauckas · December 29, 2016, 11:18am

Unitful.jl numbers do not have units when using zero, so zero(T)/one(T) would get the wrong type (wrong units) in that case. (one(T)+one(T))/one(T) would be unitless, so it should probably have a correction for units.

pint · December 29, 2016, 11:26am

before everyone gets caught in this /one(T) thing, let me remind you that it is and should be /Int. which also eliminates the unit problem.

for most types, the return type now is (T + T) / Int, which is i think as good as it gets.

ScottPJones · December 29, 2016, 4:36pm

Yes, as I’d said earlier, all of the implementations of mean use an Int for the division (either incrementing a counter in a loop for a general iterable or the return value from length or _length).

Unfortunately, using the type of (T + T) doesn’t give a useful result for smaller integer types, and how the values are accumulated is really an implementation detail that I feel should not necessarily affect the result of mean.
Also, as a user of the function, I’d rather see that the author took care that it gave a useful result for as many cases as possible, instead of having to worry about it myself (and I’d like to have a note in the documentation warning if there are any cases where the user might still need to worry about overflow issues, along with how to resolve the issue,
for example, using mean(UInt128, vec) or mean(BigInt, vec) to prevent overflow)

The current implementation of mean for an iterable also has a problem with the accumulator not being type stable, if the values are not of the same type.
I think using eltype on the collection, and then picking a type for the accumulation that would avoid type instability would both give much better results than using typeof(value + zero(value)), as well as better performance.

Note: I’m just trying to discuss any potential gotchas in advance, before implementing a replacement for mean,
if people prefer, discussion of the implementation of an improved mean can be moved to a different topic.

Tamas_Papp · December 29, 2016, 5:23pm

One rule which is consistent:

For non-empty arrays, the type of mean(x) could be enforced to always be equivalent to reduce(+, x)/length(x). This means that overflows etc happen as they would with +.
For empty arrays, mean throws an error (or returns a NaN if possible).
Same applies to sum, except for empty arrays we get zero(eltype(x)), or if that is not defined, an error.

For integer arithmetic, the values should also be equal (ie overflow the same way). For floating point, they could differ because of floating point error.

This seems somewhat harsh, as you are likely to get nonsensical results with overflow. The advantage is consistency and type-stability. There could be a sum(T, x) signature that is equivalent to reduce(+, zero(T), x) for widening on demand.

The other choice is to rely on the implementation to choose the “right” kind of widening. There is no general rule that would satisfy every application, and it will either overshoot, lead to overflow anyway for some cases, or be type unstable if it tries something clever in between.

ScottPJones · December 29, 2016, 7:31pm

About 1., the promotion rule for reduce(+ is not the same as for simply +, it has some fairly complex rules on how to determine the result type. I wonder if that should be made visible, and specialized more for certain operators. For example, while it makes sense for + or * to use a different type to accumulate a result, for things like &, |, and xor, no widening is needed (or really desired).
About 2., empty arrays currently throw an error, I think that’s more consistent than returning a NaN for mean just for those types that have a NaN.

The sum(f::Callable, a) method can already be used to make sure that the widening you desire is performed (it’s equivalent to mapreduce(f, +,a)), so I’m not as concerned about that (although a note in the documentation might be useful), my concern is really with different collections of the same values getting different results.

With mean, for integer values, it’s possible to always get the exact sum before performing a final division, without incurring a huge performance cost (in fact, the performance may be a good deal better than the current implementation that overflows). If people feel that it’s not useful to have a mean implementation (for integer values) that always produces a “correct” value in Base, I suppose that could be provided in a package (along with implementations of things like sum that can more efficiently calculate a sum without overflow, without the great performance issues of simply using something like BigInt to do so, just as there is already a slower but more accurate sum_kbn function in Base to sum floating point values.

mzaffalon · January 2, 2017, 3:07pm

You probably meant (zero(T)+zero(T))/one(T) in your original post?

Isn’t it responsibility of the author of the custom type to make sure that the operations are type stable and would it not be better to let it fail inside mean when this is not the case instead of using this workaround?

nalimilan · January 2, 2017, 3:25pm

Note I said “the type of”; indeed the initial value must be 0, and that’s one way to get it.

I’m not sure we want to force that addition preserves the original type (which BTW is different from the usual definition of type-stability in Julia). For example, Bool currently does not follow this rule in Base.

mbauman · September 19, 2017, 4:53pm

2 posts were split to a new topic: Reductions along dimension for tuples

Topic		Replies	Views
Mean of integers overflows - bug or expected behaviour? Statistics integer-overflow	94	4291	March 3, 2020
Was: Inconsistent behavior of `sum` and `mean` Community	5	1110	December 28, 2016
Why I love Julia Community announcement	34	4925	October 11, 2019
Tonight I hated julia, REPL and overflows New to Julia integer-overflow	64	6526	March 14, 2020
Type inference of sum result General Usage question	2	561	December 26, 2016

Inconsistent behavior of `sum`,`mean` (and probably others) on different collection types

Related topics