Inconsistent behavior of `sum`,`mean` (and probably others) on different collection types

note, that the input type is now not preserved in general. for iterables, there is a complex mechanism to determine the accumulator type. in particular for integers, it widens to Int32 from smaller sizes, and to Int64 from Int32 or Int64. for non-integers, it becomes typeof(x + zero(x)). this is not the case for tuples, which are simply summed. i would assume it is because summing tuples is not a particularly meaningful operation, but there might be some more specific reason.

At this point, we should really consider removing automatic widening from reduction over numeric types – except Bool which I think should not be numeric anymore anyway. If someone wants a larger result type, it’s better that they specify it, otherwise, we should just give them what normal promotion rules would give them. In particular, for collections of same-type elements, we don’t do promotion anymore for normal summation, etc., and I think reductions should follow that. That said, perhaps we should do accumulation in the result type, so mean over all integer types would accumulate using Float64 because of the integer division at the end.

1 Like

Why not zero(T)/one(T)?

1 Like

Just in case addition returns a different type. In practice it probably won’t make a difference because of the final division, but who knows what a custom type might want to do…

Unitful.jl numbers do not have units when using zero, so zero(T)/one(T) would get the wrong type (wrong units) in that case. (one(T)+one(T))/one(T) would be unitless, so it should probably have a correction for units.

before everyone gets caught in this /one(T) thing, let me remind you that it is and should be /Int. which also eliminates the unit problem.

for most types, the return type now is (T + T) / Int, which is i think as good as it gets.

2 Likes

Yes, as I’d said earlier, all of the implementations of mean use an Int for the division (either incrementing a counter in a loop for a general iterable or the return value from length or _length).

Unfortunately, using the type of (T + T) doesn’t give a useful result for smaller integer types, and how the values are accumulated is really an implementation detail that I feel should not necessarily affect the result of mean.
Also, as a user of the function, I’d rather see that the author took care that it gave a useful result for as many cases as possible, instead of having to worry about it myself (and I’d like to have a note in the documentation warning if there are any cases where the user might still need to worry about overflow issues, along with how to resolve the issue,
for example, using mean(UInt128, vec) or mean(BigInt, vec) to prevent overflow)

The current implementation of mean for an iterable also has a problem with the accumulator not being type stable, if the values are not of the same type.
I think using eltype on the collection, and then picking a type for the accumulation that would avoid type instability would both give much better results than using typeof(value + zero(value)), as well as better performance.

Note: I’m just trying to discuss any potential gotchas in advance, before implementing a replacement for mean,
if people prefer, discussion of the implementation of an improved mean can be moved to a different topic.

One rule which is consistent:

  1. For non-empty arrays, the type of mean(x) could be enforced to always be equivalent to reduce(+, x)/length(x). This means that overflows etc happen as they would with +.
  2. For empty arrays, mean throws an error (or returns a NaN if possible).
  3. Same applies to sum, except for empty arrays we get zero(eltype(x)), or if that is not defined, an error.

For integer arithmetic, the values should also be equal (ie overflow the same way). For floating point, they could differ because of floating point error.

This seems somewhat harsh, as you are likely to get nonsensical results with overflow. The advantage is consistency and type-stability. There could be a sum(T, x) signature that is equivalent to reduce(+, zero(T), x) for widening on demand.

The other choice is to rely on the implementation to choose the “right” kind of widening. There is no general rule that would satisfy every application, and it will either overshoot, lead to overflow anyway for some cases, or be type unstable if it tries something clever in between.

About 1., the promotion rule for reduce(+ is not the same as for simply +, it has some fairly complex rules on how to determine the result type. I wonder if that should be made visible, and specialized more for certain operators. For example, while it makes sense for + or * to use a different type to accumulate a result, for things like &, |, and xor, no widening is needed (or really desired).
About 2., empty arrays currently throw an error, I think that’s more consistent than returning a NaN for mean just for those types that have a NaN.

The sum(f::Callable, a) method can already be used to make sure that the widening you desire is performed (it’s equivalent to mapreduce(f, +,a)), so I’m not as concerned about that (although a note in the documentation might be useful), my concern is really with different collections of the same values getting different results.

With mean, for integer values, it’s possible to always get the exact sum before performing a final division, without incurring a huge performance cost (in fact, the performance may be a good deal better than the current implementation that overflows). If people feel that it’s not useful to have a mean implementation (for integer values) that always produces a “correct” value in Base, I suppose that could be provided in a package (along with implementations of things like sum that can more efficiently calculate a sum without overflow, without the great performance issues of simply using something like BigInt to do so, just as there is already a slower but more accurate sum_kbn function in Base to sum floating point values.

You probably meant (zero(T)+zero(T))/one(T) in your original post?

Isn’t it responsibility of the author of the custom type to make sure that the operations are type stable and would it not be better to let it fail inside mean when this is not the case instead of using this workaround?

Note I said “the type of”; indeed the initial value must be 0, and that’s one way to get it.

I’m not sure we want to force that addition preserves the original type (which BTW is different from the usual definition of type-stability in Julia). For example, Bool currently does not follow this rule in Base.

2 posts were split to a new topic: Reductions along dimension for tuples