# Array ordering and naive summation

One of the first things you learn in numerical analysis is that floating-point operations are not associative. A classic example is this:

``````julia> (0.1 + 0.2) + 0.3
0.6000000000000001

julia> 0.1 + (0.2 + 0.3)
0.6
``````

I was thinking of ways to make floating-point summation independent of the order of the summands without making the performance much worse (this is a hobby of mine). Julia currently uses a pairwise summation algorithm, which is much better than naive left-to-right reduction while having very nearly the same speed. There is also a `sum_kbn` function, which does "Kahan summation, aka compensated summation, which is slower but even more accurate. Yet neither of these is completely order-independent. This got me to wondering about how many different sums you can get by reordering a set of floating-point numbers. Which ultimately led me to this little function:

``````function sumsto(x::Float64)
0 <= x < exp2(970) || throw(ArgumentError("sum must be in [0,2^970)"))
n, pâ‚€ = Base.decompose(x) # integers such that `n*exp2(pâ‚€) == x`
[realmax(); [exp2(p) for p in -1074:969 if iseven(n >> (p-pâ‚€))]
-realmax(); [exp2(p) for p in -1074:969 if isodd(n >> (p-pâ‚€))]]
end
``````

The `sumsto` function always returns the same 2046 floating-point numbers but returns them in a different order based on `x`: for any 64-bit float value of `x` from 0 up to (but not including) 2^970, the naive left-to-right sum of the vector returned by `sumsto(x)` is precisely `x`:

``````julia> foldl(+, sumsto(0.0))
0.0

julia> foldl(+, sumsto(eps(0.0)))
5.0e-324

julia> foldl(+, sumsto(1.23))
1.23

julia> foldl(+, sumsto(e+0))
2.718281828459045

julia> foldl(+, sumsto(pi+0))
3.141592653589793

julia> foldl(+, sumsto(6.0221409e23))
6.0221409e23

julia> foldl(+, sumsto(9.979201547673598e291))
9.979201547673598e291
``````

To test this a little more thoroughly, we can use this helper function to generate good random values of `x`:

``````randx() = reinterpret(Float64, reinterpret(UInt64, realmax()/2^54) & rand(UInt64))

julia> [randx() for _=1:10]
10-element Array{Float64,1}:
3.05224e212
2.34736e176
9.67114e-97
1.29412e193
3.9435e231
2.39664e-248
3.18458e-171
4.83044e-270
1.15809e-38
9.18716e291
``````

As you can see, `randx()` produces positive floating-point values with widely varying exponents and random significand bits. With this we can test `sumsto` a bit harder:

``````julia> x = randx()
5.769336376990973e214

julia> v = sumsto(x)
2046-element Array{Float64,1}:
1.79769e308
4.94066e-324
9.88131e-324
â‹®
2.69319e213
1.07728e214
4.3091e214

julia> foldl(+, v)
5.769336376990973e214
``````

If we test `sumsto` on a thousand different `randx()` values, it passes:

``````julia> all(1:1000) do _
x = randx()
foldl(+, sumsto(x)) == x
end
true
``````

Returning to the original question of how many different sums you can get by reordering summands in naive floating-point summation, we see that the answer is â€śall of themâ€ť: summing just 2046 values in different orders can produce nearly any value!

23 Likes

This is beautiful! and deeply confusing

How do the constants 64-bit, x = 970 and y = 2046 relate to each other? If you had y = 1023, how much smaller does x get/is it linear/etc?

The code takes `x` and decomposes it into a sum of powers of two, i.e. some subset of `exp2(p)` for `-1074 â‰¤ p < 970`. It separates those powers of two into the ones you want to keep and the ones you want to discard. The value `exp2(970)` is the first power of two that alters `realmax()` by addition:

``````julia> realmax() + exp2(970)
Inf

julia> realmax() + prevfloat(exp2(970))
1.7976931348623157e308

julia> ans == realmax()
true
``````

When adding the values from left to right, all the powers of two after `realmax()` but before `-realmax()` have no effect on the sum. Then `-realmax()` cancels `realmax()` out, bringing the sum back to zero so the remaining powers of two are added up as expected, giving the final result, `x`. There are 2044 powers of two that can be represented as 64-bit floating-point value below 2^970 since the smallest representable power of 2 is `eps(0.0) == exp2(-1074) == 5.0e-324`. So the number of values weâ€™re summing is 1074 + 970 + 2, one for each power of two and two more for `realmax()` and `-realmax()`.

Iâ€™m also interested in how bad this can be with only non-negative values, since an obvious way to avoid this kind of trick is to keep separate accumulators for positive and negative values.

With non-negative values, summation is well-conditioned (the condition number is 1 in the L1 norm), so I think there are much stronger constraints on the magnitude of the forward error.

1 Like

Right. I think the worst you can do is to see the largest value first and then get a series of values that are just too small to have an effect on that value. Specifically, something like this:

``````[prevfloat(exp2(54+p)), prevfloat(exp2(p)), ... [2^54 times], prevfloat(exp2(p))]
``````

The correct sum of this series of values is about 2x what youâ€™ll get. So 100% error for 18 trillion values. Keeping separate positive and negative accumulators seems to be a good thing, but itâ€™s hard to say if itâ€™s worth the additional work.

For posterity (and because this is one of my favorite floating point shenanigans posts), the `realmax` function is `floatmax` now. So the patched code is

``````function sumsto(x::Float64)
0 <= x < exp2(970) || throw(ArgumentError("sum must be in [0,2^970)"))
n, pâ‚€ = Base.decompose(x) # integers such that `n*exp2(pâ‚€) == x`
[floatmax(); [exp2(p) for p in -1074:969 if iseven(n >> (p-pâ‚€))]
-floatmax(); [exp2(p) for p in -1074:969 if isodd(n >> (p-pâ‚€))]]
end
``````
10 Likes