Array ordering and naive summation



One of the first things you learn in numerical analysis is that floating-point operations are not associative. A classic example is this:

julia> (0.1 + 0.2) + 0.3

julia> 0.1 + (0.2 + 0.3)

I was thinking of ways to make floating-point summation independent of the order of the summands without making the performance much worse (this is a hobby of mine). Julia currently uses a pairwise summation algorithm, which is much better than naive left-to-right reduction while having very nearly the same speed. There is also a sum_kbn function, which does "Kahan summation, aka compensated summation, which is slower but even more accurate. Yet neither of these is completely order-independent. This got me to wondering about how many different sums you can get by reordering a set of floating-point numbers. Which ultimately led me to this little function:

function sumsto(x::Float64)
    0 <= x < exp2(970) || throw(ArgumentError("sum must be in [0,2^970)"))
    n, p₀ = Base.decompose(x) # integers such that `n*exp2(p₀) == x`
    [realmax(); [exp2(p) for p in -1074:969 if iseven(n >> (p-p₀))]
    -realmax(); [exp2(p) for p in -1074:969 if isodd(n >> (p-p₀))]]

The sumsto function always returns the same 2046 floating-point numbers but returns them in a different order based on x: for any 64-bit float value of x from 0 up to (but not including) 2^970, the naive left-to-right sum of the vector returned by sumsto(x) is precisely x:

julia> foldl(+, sumsto(0.0))

julia> foldl(+, sumsto(eps(0.0)))

julia> foldl(+, sumsto(1.23))

julia> foldl(+, sumsto(e+0))

julia> foldl(+, sumsto(pi+0))

julia> foldl(+, sumsto(6.0221409e23))

julia> foldl(+, sumsto(9.979201547673598e291))

To test this a little more thoroughly, we can use this helper function to generate good random values of x:

randx() = reinterpret(Float64, reinterpret(UInt64, realmax()/2^54) & rand(UInt64))

julia> [randx() for _=1:10]
10-element Array{Float64,1}:

As you can see, randx() produces positive floating-point values with widely varying exponents and random significand bits. With this we can test sumsto a bit harder:

julia> x = randx()

julia> v = sumsto(x)
2046-element Array{Float64,1}:

julia> foldl(+, v)

If we test sumsto on a thousand different randx() values, it passes:

julia> all(1:1000) do _
           x = randx()
           foldl(+, sumsto(x)) == x

Returning to the original question of how many different sums you can get by reordering summands in naive floating-point summation, we see that the answer is “all of them”: summing just 2046 values in different orders can produce nearly any value!


This is beautiful! :heart: and deeply confusing :confused: :stuck_out_tongue:

How do the constants 64-bit, x = 970 and y = 2046 relate to each other? If you had y = 1023, how much smaller does x get/is it linear/etc?


The code takes x and decomposes it into a sum of powers of two, i.e. some subset of exp2(p) for -1074 ≤ p < 970. It separates those powers of two into the ones you want to keep and the ones you want to discard. The value exp2(970) is the first power of two that alters realmax() by addition:

julia> realmax() + exp2(970)

julia> realmax() + prevfloat(exp2(970))

julia> ans == realmax()

When adding the values from left to right, all the powers of two after realmax() but before -realmax() have no effect on the sum. Then -realmax() cancels realmax() out, bringing the sum back to zero so the remaining powers of two are added up as expected, giving the final result, x. There are 2044 powers of two that can be represented as 64-bit floating-point value below 2^970 since the smallest representable power of 2 is eps(0.0) == exp2(-1074) == 5.0e-324. So the number of values we’re summing is 1074 + 970 + 2, one for each power of two and two more for realmax() and -realmax().

I’m also interested in how bad this can be with only non-negative values, since an obvious way to avoid this kind of trick is to keep separate accumulators for positive and negative values.


With non-negative values, summation is well-conditioned (the condition number is 1 in the L1 norm), so I think there are much stronger constraints on the magnitude of the forward error.


Right. I think the worst you can do is to see the largest value first and then get a series of values that are just too small to have an effect on that value. Specifically, something like this:

[prevfloat(exp2(54+p)), prevfloat(exp2(p)), ... [2^54 times], prevfloat(exp2(p))]

The correct sum of this series of values is about 2x what you’ll get. So 100% error for 18 trillion values. Keeping separate positive and negative accumulators seems to be a good thing, but it’s hard to say if it’s worth the additional work.