julia> N = 100000000; aa = rand(Float32, N);
julia> mean((x for x in aa))
0.16777216f0
julia> mean(aa)
0.500059f0

I know what’s happening: Float32 has limited accuracy, and mean over an iterator computes a “naive” sum, which saturates at some point. However, it caught me by surprise in my real-world scenario:

julia> mean(skipmissing(Umat))
1.0638367f0 V
julia> mean(filter(!ismissing, Umat))
3.1320891f0 V

That’s a bit scary. Is that just Something I Must Watch Out For? It feels like there should be a reasonable algorithm that gives good accuracy without sacrificing speed. Something like

function sum(iter)
total = zero(eltype(iter))
minitotal = zero(eltype(iter))
i = 0
for x in iter
if (i+=1) % 100000 == 0
total += minitotal
minitotal = zero(eltype(iter))
end
minitotal += x
end
return total + minitotal
end

I’m not particularly good with numerical methods, but wouldn’t that be similar in efficiency, with much better accuracy?

Yes, this is a big topic. For example, see Kahan summation algorithm - Wikipedia . The Julia standard library has alternative summation implementation rules for this.

That’s a great example of the problem of sequential summation.

For arrays, Julia uses a hybrid sequential/pairwise approach for summation (using sequential for small chunks, then accumulating these chunks in a pairwise manner), which is why sum(aa) avoids this problem.

However there’s no reason why we couldn’t do the same thing for general iterators: could you open an issue?

Not sure about general iterables (the algorithm would have to be adapted since we can’t access items at arbitrary positions like for arrays), but for skipmissing we could definitely do that.

A direct implementation of pairwise summation requires random access, which is we we don’t use the pairwise algorithm for general iterators.

However, I suppose it might be possible to implement pairwise summation on a sequential iterator by allocating an explicit “stack” of O(log₂n) accumulators, and doing accumulation on the correct one as we go along.

The current implementation does access them in order, it just uses the call stack as the storage stack. However you do need to know the length ahead of time (so you know how deep the stack needs to be).

Not necessarily; you can just expand as necessary. Eg after 10 levels of depth are used up, add another 10. Someone just needs to code this, which can be tricky to do in a compiler-friendly way if the iterator has unknown eltype.