Imprecision of mean over iterators of large vectors

cstjean · December 17, 2018, 6:43pm

To wit:

julia> N = 100000000; aa = rand(Float32, N);

julia> mean((x for x in aa))
0.16777216f0

julia> mean(aa)
0.500059f0

I know what’s happening: Float32 has limited accuracy, and mean over an iterator computes a “naive” sum, which saturates at some point. However, it caught me by surprise in my real-world scenario:

julia> mean(skipmissing(Umat))
1.0638367f0 V

julia> mean(filter(!ismissing, Umat))
3.1320891f0 V

That’s a bit scary. Is that just Something I Must Watch Out For? It feels like there should be a reasonable algorithm that gives good accuracy without sacrificing speed. Something like

function sum(iter)
    total = zero(eltype(iter))
    minitotal = zero(eltype(iter))
    i = 0
    for x in iter
        if (i+=1) % 100000 == 0
            total += minitotal
            minitotal = zero(eltype(iter))
        end
        minitotal += x
    end
    return total + minitotal
end

I’m not particularly good with numerical methods, but wouldn’t that be similar in efficiency, with much better accuracy?

ChrisRackauckas · December 17, 2018, 7:02pm

Yes, this is a big topic. For example, see Kahan summation algorithm - Wikipedia . The Julia standard library has alternative summation implementation rules for this.

cstjean · December 17, 2018, 7:16pm

That’s the algorithm used by sum(::Vector), right? Why can’t it be used with an iterable/generator?

simonbyrne · December 17, 2018, 7:20pm

That’s a great example of the problem of sequential summation.

For arrays, Julia uses a hybrid sequential/pairwise approach for summation (using sequential for small chunks, then accumulating these chunks in a pairwise manner), which is why sum(aa) avoids this problem.

However there’s no reason why we couldn’t do the same thing for general iterators: could you open an issue?

nalimilan · December 17, 2018, 8:17pm

Not sure about general iterables (the algorithm would have to be adapted since we can’t access items at arbitrary positions like for arrays), but for skipmissing we could definitely do that.

stevengj · December 17, 2018, 8:42pm

A direct implementation of pairwise summation requires random access, which is we we don’t use the pairwise algorithm for general iterators.

However, I suppose it might be possible to implement pairwise summation on a sequential iterator by allocating an explicit “stack” of O(log₂n) accumulators, and doing accumulation on the correct one as we go along.

simonbyrne · December 17, 2018, 9:47pm

The current implementation does access them in order, it just uses the call stack as the storage stack. However you do need to know the length ahead of time (so you know how deep the stack needs to be).

Tamas_Papp · December 18, 2018, 7:48am

Not necessarily; you can just expand as necessary. Eg after 10 levels of depth are used up, add another 10. Someone just needs to code this, which can be tricky to do in a compiler-friendly way if the iterator has unknown eltype.

Topic		Replies	Views
Kahan summation in `sum`? General Usage float , precision	13	1286	December 29, 2024
Why is there a difference when you Sum over an Array versus a Generator? New to Julia question	5	475	October 3, 2021
Julia equivalent of Python's "fsum" for floating point summation General Usage python	54	7299	August 16, 2019
Speed issue with KahanSummation Performance question , package , parallel	12	1245	February 5, 2022
"sum([1.0, 1e100, 1.0, -1e100])" fails General Usage question , numbers	13	1512	February 16, 2022

Imprecision of mean over iterators of large vectors

Related topics