Array ordering and naive summation

StefanKarpinski · February 5, 2017, 9:13pm

One of the first things you learn in numerical analysis is that floating-point operations are not associative. A classic example is this:

julia> (0.1 + 0.2) + 0.3
0.6000000000000001

julia> 0.1 + (0.2 + 0.3)
0.6

I was thinking of ways to make floating-point summation independent of the order of the summands without making the performance much worse (this is a hobby of mine). Julia currently uses a pairwise summation algorithm, which is much better than naive left-to-right reduction while having very nearly the same speed. There is also a sum_kbn function, which does "Kahan summation, aka compensated summation, which is slower but even more accurate. Yet neither of these is completely order-independent. This got me to wondering about how many different sums you can get by reordering a set of floating-point numbers. Which ultimately led me to this little function:

function sumsto(x::Float64)
    0 <= x < exp2(970) || throw(ArgumentError("sum must be in [0,2^970)"))
    n, p₀ = Base.decompose(x) # integers such that `n*exp2(p₀) == x`
    [realmax(); [exp2(p) for p in -1074:969 if iseven(n >> (p-p₀))]
    -realmax(); [exp2(p) for p in -1074:969 if isodd(n >> (p-p₀))]]
end

The sumsto function always returns the same 2046 floating-point numbers but returns them in a different order based on x: for any 64-bit float value of x from 0 up to (but not including) 2^970, the naive left-to-right sum of the vector returned by sumsto(x) is precisely x:

julia> foldl(+, sumsto(0.0))
0.0

julia> foldl(+, sumsto(eps(0.0)))
5.0e-324

julia> foldl(+, sumsto(1.23))
1.23

julia> foldl(+, sumsto(e+0))
2.718281828459045

julia> foldl(+, sumsto(pi+0))
3.141592653589793

julia> foldl(+, sumsto(6.0221409e23))
6.0221409e23

julia> foldl(+, sumsto(9.979201547673598e291))
9.979201547673598e291

To test this a little more thoroughly, we can use this helper function to generate good random values of x:

randx() = reinterpret(Float64, reinterpret(UInt64, realmax()/2^54) & rand(UInt64))

julia> [randx() for _=1:10]
10-element Array{Float64,1}:
 3.05224e212
 2.34736e176
 9.67114e-97
 1.29412e193
 3.9435e231
 2.39664e-248
 3.18458e-171
 4.83044e-270
 1.15809e-38
 9.18716e291

As you can see, randx() produces positive floating-point values with widely varying exponents and random significand bits. With this we can test sumsto a bit harder:

julia> x = randx()
5.769336376990973e214

julia> v = sumsto(x)
2046-element Array{Float64,1}:
 1.79769e308
 4.94066e-324
 9.88131e-324
 ⋮
 2.69319e213
 1.07728e214
 4.3091e214

julia> foldl(+, v)
5.769336376990973e214

If we test sumsto on a thousand different randx() values, it passes:

julia> all(1:1000) do _
           x = randx()
           foldl(+, sumsto(x)) == x
       end
true

Returning to the original question of how many different sums you can get by reordering summands in naive floating-point summation, we see that the answer is “all of them”: summing just 2046 values in different orders can produce nearly any value!

felix · February 6, 2017, 8:26am

This is beautiful! and deeply confusing

How do the constants 64-bit, x = 970 and y = 2046 relate to each other? If you had y = 1023, how much smaller does x get/is it linear/etc?

StefanKarpinski · February 6, 2017, 6:38pm

The code takes x and decomposes it into a sum of powers of two, i.e. some subset of exp2(p) for -1074 ≤ p < 970. It separates those powers of two into the ones you want to keep and the ones you want to discard. The value exp2(970) is the first power of two that alters realmax() by addition:

julia> realmax() + exp2(970)
Inf

julia> realmax() + prevfloat(exp2(970))
1.7976931348623157e308

julia> ans == realmax()
true

When adding the values from left to right, all the powers of two after realmax() but before -realmax() have no effect on the sum. Then -realmax() cancels realmax() out, bringing the sum back to zero so the remaining powers of two are added up as expected, giving the final result, x. There are 2044 powers of two that can be represented as 64-bit floating-point value below 2^970 since the smallest representable power of 2 is eps(0.0) == exp2(-1074) == 5.0e-324. So the number of values we’re summing is 1074 + 970 + 2, one for each power of two and two more for realmax() and -realmax().

I’m also interested in how bad this can be with only non-negative values, since an obvious way to avoid this kind of trick is to keep separate accumulators for positive and negative values.

stevengj · February 6, 2017, 7:22pm

With non-negative values, summation is well-conditioned (the condition number is 1 in the L1 norm), so I think there are much stronger constraints on the magnitude of the forward error.

StefanKarpinski · February 6, 2017, 7:32pm

Right. I think the worst you can do is to see the largest value first and then get a series of values that are just too small to have an effect on that value. Specifically, something like this:

[prevfloat(exp2(54+p)), prevfloat(exp2(p)), ... [2^54 times], prevfloat(exp2(p))]

The correct sum of this series of values is about 2x what you’ll get. So 100% error for 18 trillion values. Keeping separate positive and negative accumulators seems to be a good thing, but it’s hard to say if it’s worth the additional work.

Sukera · May 13, 2023, 6:00pm

For posterity (and because this is one of my favorite floating point shenanigans posts), the realmax function is floatmax now. So the patched code is

function sumsto(x::Float64)
    0 <= x < exp2(970) || throw(ArgumentError("sum must be in [0,2^970)"))
    n, p₀ = Base.decompose(x) # integers such that `n*exp2(p₀) == x`
    [floatmax(); [exp2(p) for p in -1074:969 if iseven(n >> (p-p₀))]
    -floatmax(); [exp2(p) for p in -1074:969 if isodd(n >> (p-p₀))]]
end