Accurate summation algorithm

THere is a whole zoo of generalisations of and alternatives to Kahans algorithm.

Oh hi, I’m your friendly compiler update. I can now happily unroll loops and inline functions.

# ...
sum1 = sum1 + 3465/(real(1,dp)**2-1/16._dp) + 3465/((1+0.5_dp)**2-1/16._dp)
sum1 = sum1 + 3465/(real(2,dp)**2-1/16._dp) + 3465/((2+0.5_dp)**2-1/16._dp)
sum1 = sum1 + 3465/(real(3,dp)**2-1/16._dp) + 3465/((3+0.5_dp)**2-1/16._dp)
# ...
sum1 = sum1 + 3465/(real(600,dp)**2-1/16._dp) + 3465/((600+0.5_dp)**2-1/16._dp)
do while (t > 1.8187e-12_dp)
    sum2 = sum2 + t
    k = 600 + 1; t = 3465/(real(k,dp)**2-1/16._dp) + 3465/((k+0.5_dp)**2-1/16._dp)
end do
print *, sum1 + sum2 

Oh hey, you know, I can figure out the bounds on that do while guy, too. Let’s unroll that one, too, for good measure. And, hey, did you say I can perform any kind of mathematics I want with -ffast-math? I know all these answers! You’re just adding everything together in any case… I can do that for you! Let’s just re-associate things to make it a bit easier on myself. I’ll just add things up sequentially instead. That’s faster!

4 Likes

You do know that both of these algorithms are already in the Julia standard library, right?

sum_kbn is Kahan summation, and sum uses pairwise summation for any sufficiently large array.

And there is a well known paper by Higham analyzing these and other summation variants (e.g. sorting before summing). (But the analysis depends on the order of the operations, which a “fast-math” compiler might alter, of course.)

1 Like

Also, if you need the exact answer, then what you’re looking for is called a superaccumulator.

1 Like

Thanks for the reference. I didn’t know about these methods

A point I’ve made several times previously is the following:

  • Mathematically, since + is associative and commutative, all correct summation algorithms are algebraically equivalent to each other, in particular they are all equivalent to left-to-right summation.

  • The -ffast-math option gives the compiler license to transform any algorithm into any algebraically equivalent one in the mathematical sense.

  • Therefore -ffast-math gives the compiler license to transform any summation algorithm into any other summation algorithm, including left-to-right summation.

That’s why there’s no fundamentally useful answer to this question: the only thing you can find out is how a particular compiler happens to optimize your code. A different compiler in -ffast-math mode or a different version of your own compiler – or the exact same compiler targeting a different architecture – can always make the situation different and potentially much worse. The example above happens to outsmart one particular compiler, but as @mbauman’s post shows, a slightly smarter compiler could straightforwardly overcome that obstacle and make this exact same code fail. Or put another way, there are no reliable summation algorithms in -ffast-math mode.

The same argument applies to superaccumulators in principle, but they typically operate on floating-point values reinterpreted as integers which -ffast-math gives no additional license to optimize, so they should be safe enough. They’re much slower than approximate summation algorithms, however.

The Higham summation paper:

2 Likes

I found much of the discussion fascinating.

While there have already been umpteen examples, StefanKarpinski said:

In the particular case of openlibm, writing correct algorithms without the guarantees of IEEE will end up being be less efficient than just turning IEEE mode arithmetic on. If you are in the business of writing mathematical code that gives consistent, predictable, correct answers, IEEE is not the enemy of performance, it is precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.

Thought I’d add I recently saw a post somewhere using the trapezoid rule for the integral of sin(x) from 0 to pi. The correct answer is of course -cos(pi) - (-cos(0) ) = 2.
Using 10,000,000 evaluations so it takes a while, a Fortran implementation using Kahan summation:

program integral
implicit none

integer, parameter                  :: n_lim = 10000000
real, parameter                     ::  a = acos( - 1.0 ), &
                                        dx = (a - 0.0) / n_lim
real                                :: t, sinx, r, cr, yr, tr, cs, ys, ts
integer                             :: n

sinx = 0.5 * ( sin(0.0) + sin(a) )
r = dx; cs = 0.0; cr = 0.0
do n = 2, n_lim
    ys = sin(r) - cs
    ts = sinx + ys
    cs = (ts - sinx) - ys
    sinx = ts

    yr = dx - cr
    tr = r + yr
    cr = (tr - r) - yr
    r = tr
end do

sinx = sinx * dx

write(*,*) sinx

end program integral

The results, first compiling with IEEE, then -ffast-math, and finally -ffast-math and double precision:

$ gfortran-7.2 -O3 -funroll-loops sin_integral.f08 -o int
$ time ./int   2.00000000    

real	0m0.073s
user	0m0.069s
sys	0m0.004s
$ gfortran-7.2 -Ofast sin_integral.f08 -o int_fastmath
$ time ./int_fastmath
   2.10740471    

real	0m0.056s
user	0m0.056s
sys	0m0.000s
$ gfortran-7.2 -Ofast -fdefault-real-8 sin_integral.f08 -o int_fastmath_double
$ time ./int_fastmath_double
   2.0000000004135496     

real	0m0.147s
user	0m0.147s
sys	0m0.000s

The third answer is close enough – but still more error than IEEE, and took over twice as long!
Seems like a clear example of “IEEE is…precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.”

The sin(x) from 0 to pi isn’t nearly as badly behaved as many of the example summations, but still features a lot of numbers closer to zero, and numerical integration is a fairly common application.

Also, for fun, running on Julia 0.7:

julia> function integral(f, l, u, n = 10_000_000)
           dx = (u - l) / n
           T = eltype(dx)
           dx * (sum(f, linspace(dx, u - dx, n-1)) + (f(T(l)) + f(T(u)))/2)
       end

julia> integral(sin, 0f0, π)
1.9999998f0

julia> integral(sin, 0, π)
1.9999997999999437

julia> @benchmark integral(sin, 0f0, π)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  3
  --------------
  minimum time:     51.116 ms (0.00% GC)
  median time:      51.818 ms (0.00% GC)
  mean time:        52.330 ms (0.00% GC)
  maximum time:     58.548 ms (0.00% GC)
  --------------
  samples:          96
  evals/sample:     1

julia> @benchmark integral(sin, 0, π)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     80.174 ms (0.00% GC)
  median time:      80.943 ms (0.00% GC)
  mean time:        81.045 ms (0.00% GC)
  maximum time:     83.025 ms (0.00% GC)
  --------------
  samples:          62
  evals/sample:     1

I’m not exactly sure why it allocates when calling with a Float32 argument. Unfortunately, sum_kbn doesn’t have a method that takes a function and an iterable as arguments.
But simply using the Base sum and linspace functions give great performance and fairly solid accuracy.

To be a little more fair (to be totally fair, would probably have to time them in the same way – maybe there is a 20 ms overhead on time and launching the executable?), implementing the exact same algorithm:

julia> function integral_kbn2(f, l, u, n = 10_000_000)
           dx = (u - l) / n
           T = eltype(dx)
           sinx = (f(T(l)) + f(T(u))) / 2
           r = dx; cs = zero(T); cr = zero(T)
           for n ∈ 2:n_lim
               ys = f(r) - cs
               ts = sinx + ys
               cs = (ts - sinx) - ys
               sinx = ts
           
               yr = dx - cr
               tr = r + yr
               cr = (tr - r) - yr
               r = tr
           end
           sinx * dx
       end
integral_kbn2 (generic function with 2 methods)

julia> integral_kbn2(sin, 0f0, π)
2.0f0

julia> integral_kbn2(sin, 0, π)
1.9999999999999836

julia> @benchmark integral_kbn2(sin, 0f0, π)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.491 ms (0.00% GC)
  median time:      55.370 ms (0.00% GC)
  mean time:        55.433 ms (0.00% GC)
  maximum time:     59.978 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

julia> @benchmark integral_kbn2(sin, 0, π)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     128.128 ms (0.00% GC)
  median time:      129.914 ms (0.00% GC)
  mean time:        131.293 ms (0.00% GC)
  maximum time:     138.969 ms (0.00% GC)
  --------------
  samples:          39
  evals/sample:     1

As a disclaimer, I know Julia much better than Fortran, so my Fortran implementation and/or execution may be relatively lacking.

4 Likes

I wrote down the exact (consistent and coherent) rules for the Fortran school: https://gist.github.com/certik/89dcc5bccf041ce1f57f1533c6a5d9e0#rules-how-to-write-correct-code-in-fortran-school, and I have taken every single example that anybody ever posted on either this thread or the github thread, and I analyzed it from both the IEEE and Fortran perspective, and shown which of the rules it breaks and then how to rewrite it so that it follows all the Fortran school rules and then it recovers the IEEE accuracy as expected.

My document has all the details for each example, and also discusses the rules and philosophy at length, so I don’t need to copy it here. I will just point out a few things:

  • @mbauman and @StefanKarpinski 's argument is that the compiler can fuse the loops and thus break the conditioning and the result will lose accuracy. That is a correct argument assuming the compiler can fuse the loops, but that is against the rule 2. The Fortran rules only allow the compiler to rewrite things locally, not globally.

  • @Elrod see the notes for the analysis of your problem. I provided a solution that delivers an equivalent accuracy with -ffast-math.

  • The overall idea is that when you write an algorithm, the compiler can assume that things are locally well conditioned, and thus can rewrite things at will. The compiler can’t assume globally well conditioned code, since that would essentially mean that the code is not imperative, but functional like SymPy or Mathematica, but that only works for infinite precision. For finite precision we need to tell the compiler what the global order of operations is, since global reordering can change conditioning. Local reordering however does not change the conditioning, assuming the expressions are already well conditioned. The document has numerous examples of this particular point.
    To put it another way: if a code is well conditioned, as written, then local transformations can’t break the conditioning, but global can. The compiler is not allowed to do transformations that break the conditioning, and thus only local transformations are allowed (see the document for a precise formulation, e.g., the compiler can’t introduce variables that it doesn’t know if they are well conditioned…).

  • @StefanKarpinski’s point was also that there cannot be any set of consistent and coherent rules for the Fortran school. Stefan, please show how these rules are not consistent by finding an example which follows the rules as I wrote them, but still breaks with -ffast-math.

  • The final point is what about real compilers, do they actually follow these rules? Note that this is a separate question to whether the rules are consistent. The answer is that today’s compilers indeed seem to follow these rules, and thus it is safe and accurate to use -ffast-math, as long as the user’s code follows the rules. The main rule not to be broken is the global transformation/reordering — if any compiler decided to break that in the future, then hopefully they would provide an option to turn the global transformations off.

  • I think the main difference between let’s say the Kahan summation and the Fortran school of doing things is that the Kahan summation works for a wide range of pretty badly conditioned input (although it still breaks for very badly conditioned input, as shown in the github thread). It might be that the Kahan summation is the best that you can do, without knowing anything about the input. The Fortran school works in a way that you have to know what problem you are solving and what the input is and then arrange the algorithm so that it works for the given application.

The ball is now in your court to find an example that follows the Fortran rules and breaks (=loses accuracy). Or an example which you think cannot be rewritten to follow the Fortran rules and still solve the same problem. Or any other example which you think will show that the Fortran school rules break down on.

P.S. I wrote down these rules before @mbauman, @StefanKarpinski and @Elrod commented in this thread, but I have been consulting the rules with multiple people (including @mbauman) before posting here, so that I do not waste your time if something is trivially wrong with them.

I don’t think it’s particularly relevant to make up your own rules about Fortran compilers being allowed to do “well conditioned” transformations that are “local” but “don’t introduce new variables”. Your rules aren’t part of the specification of any language or compiler, so they are irrelevant when discussing what something like -ffast-math may or may not do (including in future compiler versions). Feel free to implement a new -fcertik-math pass for LLVM, though, with a formal specification of what it can do, and then we can examine whether it is something Julia should enable!

For amusement, though, you might consider whether your “Fortran rules” allow the compiler to transform c/(b+sqrt(b^2-c)) into b-sqrt(b^2-c) to save a division? The two expressions are algebraically equivalent in exact arithmetic, and don’t involve “introducing other variables” (whatever that means). But one of these two expressions is vastly more accurate than the other when c << b^2.

By the way, conditioning (condition numbers) in numerical analysis is a property of mathematical functions for given inputs, not of algorithms/code — the whole point of condition numbers is to separate analysis of the function being computed from analysis of particular floating-point algorithms. So, the condition numbers of c/(b+sqrt(b^2-c)) and b-sqrt(b^2-c) are identical, for example. Note also that an algorithm can be numerically unstable even if the function being computed is well-conditioned, and an unstable algorithm is often algebraically equivalent to a stable algorithm (e.g. classical vs. modified Gram–Schmidt).

10 Likes

@stevengj good point about the terminology, I shouldn’t be so loose with it.

Conditioning (Condition number - Wikipedia) is a property of the problem, not the numerical method. I think what I meant by “conditioning” in the notes is actually called numerical stability, perhaps backward stability. It seems you need both well conditioning and stability to get correct answers.

For c << b^2, the expression b-sqrt(b^2-c) breaks the rule number 1, which explicitly prohibits subtracting two close numbers if the loss of accuracy is more than what is needed for the given application. So I assume using the correct terminology, it means the expression is not numerically stable. Please correct me if I am wrong.

On the other hand, the expression c/(b+sqrt(b^2-c)) for c << b^2 does not seem to break any rule. So it is numerically stable.

You are correct that for c << b^2 they are both mathematically equal to c/(2*b), and all three expressions have the same condition number, and it seems it is well conditioned say for c=1e-10 and b = 1e3.

Note that generally for such c and b, an iterative method like the Newton method is generally more robust than using these formulas in my experience. But we are talking about the principle here, not a particular application.

Another common problem with quadratic equations is if b^2 ~ c, see e.g. Table 2 in https://people.eecs.berkeley.edu/~wkahan/Qdrtcs.pdf for an example. There the system itself seems badly conditioned, so no matter what numerical method I use, I’ll have problems with getting the correct answer. I think an iterative method will not help either.

The Gram-Schmidt example is similar I think.

Essentially your main objection is how can the compiler know not to change x to (x+a)-a. This is answered in the notes, it can’t introduce a new variable “a”, that can make the expression “ill conditioned”, and I should have used the term “unstable” I think, since the conditioning of x and a+x-a is the same.

You are right this is not about introducing new variables. Take x and (x+0x1.8p52) - 0x1.8p52. We didn’t introduce any new variables, but the second expression rounds x to the nearest integer in IEEE double precision arithmetics. Both expressions have the same conditioning, but the first expression is stable, while the second expression is unstable (since in this case the correct answer we want is x, but we got round(x) instead).

Finally, to transform the stable expression c/(b+sqrt(b^2-c)) into the unstable expression b-sqrt(b^2-c), one must multiply both nominator and denominator by b-sqrt(b^2-c), but that factor suffers from numerical cancellation, so this is against the rule number 1. Essentially, when the compiler does the transformation, it can’t introduce something that screws up stability. Here is a list of transformations that gcc does: FloatingPointMath - GCC Wiki, and none of them, as far as I know, converts a stable expression into an unstable one.

Here is another good page discussing these issues with several “counter examples”: CS267: Supplementary Notes on Floating Point. There the example 1 breaks the rule number 4, and the fix is to test using eps for x close to 0, and then use rational function approximation to get accurate answers on the interval (-eps, +eps). The example 2 breaks the rule number 2 (numerical cancellation), but I would need to see a particular application to see how to fix it. Example 3 breaks the rule number 4 (if the given application requires such values of ‘x’ and ‘y’ so that the argument balances on the edge of the domain/singularity of arccos, then one must use eps to test for that). The example 5 initially breaks the rule number 4, but then it is fixed according to the rule.

Anyway, thanks for the feedback @stevengj, I will incorporate it into the notes, to make it clearer. If you have another example, please let me know, or if I made a mistake in the analysis, please let me know too.

It seems the rule is: as a developer, I solve a well conditioned problem. The particular algorithm that I implemented is stable, and the compiler cannot take a stable expression and turn it into an unstable expression.

Your “rule 1” is very general and not well-defined. Whenever an algorithm misbehaves because of floating point issues, one can always find a violation of your “rule 1”. Because of this, I am not sure it is very useful in practice; one can translate it as “use appropriate algorithms”, but no one ever uses inappropriate ones on purpose.

Unfortunately, I don’t think that the numerical issues of floating point in practice are simple enough to be captured in a few simple “do this, don’t do that” style rules.

1 Like

@Tamas_Papp I think it can be well defined. Using J. Demmel’s notes that I posted above, the error of the algorithm obeys:

    error =  alg(x) - f(x) ~ f(x+e) - f(x) ~ f(x) + e*f'(x) - f(x) = e*f'(x)

Where f'(x) is conditioning, which is assumed to be well conditioned, and e is the backward error, which is the thing that can misbehave, but as a developer you write the code in away so that e is small.

The question in this discussion is can the compiler optimization take e and make it big? And it is easy to analyze a particular transformation. The rule is that it must be stable. So take

f(x) = x
alg(x) = (x+0x1.8p52) - 0x1.8p52 = round(x)

as a particular example. Is this stable, i.e., is it true that alg(x) ~ f(x+e) for small e? It’s not, for example for x = 0.6, we get f(0.6) = 0.6, but alg(0.6) = 1.0 = f(0.6+0.4), so e = 0.4, which is not small. So the transformation x → (x+0x1.8p52) - 0x1.8p52 is not allowed per rule 1.

So it seems things can get well defined if needed. In a similar way, we can analyze all transformations from the page FloatingPointMath - GCC Wiki, but I don’t see any, that would break the now well defined rule of assuming alg(x) = f(x+e) for small “e”, does the transformation alg(x) → alg2(x) break the rule alg2(x) = f(x+e) for small e?

If you find a transformation that breaks this, let me know. I would like to know if I am making a mistake in the reasoning.

So the point of the rules I wrote down is to check both user’s code, but also compilers to ensure they actually obey the rules.

I am afraid I lost track of the objective of this discussion. Are you trying to formalize a set of rules different from IEEE that are somehow compiler- and architecture-independent? This might take more work than a ~10 bullet points. Or proposing that Julia implements an alternative set of rules which are similar to --ffast-math in spirit, but formalized by a standard?

1 Like

Considering that you are just now learning about conditioning and stability, and are still struggling with definitions, perhaps you shouldn’t be so glib about whether the analysis is easy to automate.

3 Likes

@Tamas_Papp:

Are you trying to formalize a set of rules different from IEEE that are somehow compiler- and architecture-independent? This might take more work than a ~10 bullet points.

Yes.

Or proposing that Julia implements an alternative set of rules which are similar to --ffast-math in spirit, but formalized by a standard?

No. You do what you think is best for Julia. However, I do ask the Julia community to either:

  • agree that the rules are consistent and coherent

or

  • find a counter example that cannot be refuted

In the first case I would then like to propose to put the -ffast-math Fortran benchmark as a separate example on the Julia benchmark’s page. In the second case I will apologize for being wrong, and post a blog post showing why such rules are fundamentally inconsistent and incoherent, which would be a valuable lesson.

@stevengj I did learn about this at school, but that was over 10 years now and I haven’t used it since in practice. I apologize for being sloppy about it. If you have an actual argument where my rules or reasoning is wrong, please let me know. When I said “easy to analyze”, I didn’t mean automatic analysis, but a paper + pencil analysis as I tried to show on a particular example.

How can you possible say that statements like

the compiler can rearrange the order of operations arbitrarily, as long as the changes are done locally (on the order of 10 elements), as that does not change the accuracy significantly. It cannot rearrange instructions globally (more than ~1000 operations apart), as that could potentially change the conditioning significantly.

are “consistent and coherent”.

These “guidelines” are so vague and open to interpretation that you can just bend them to fit the result of the examples you try them on. How should a compiler writer implement rules that are written this way? Does the compilation step involving calling you and asking you to judge whether an optimization is valid?

2 Likes

I think your “rule” is basically that the compiler cannot increase the error significantly, so by definition any counter-example would violate your “rule”…your argument is basically tautology. My point is that it is incredibly difficult to automate such a “rule” when analyzing arbitrary algebraic transformations, so it cannot be used to implement a working compiler optimization.

(There are academic papers on automating error analysis for compiler optimizations, but they only handle a very restricted class of problems/optimizations.)

In the first case I would then like to propose to put the -ffast-math Fortran benchmark as a separate example on the Julia benchmark’s page.

-ffast-math is unrelated to your “Fortran rules”. No compiler specification provides the guarantees that you are claiming. So, your whole document is irrelevant to that discussion.

As I commented elsewhere, if -ffast-math dramatically speeds up some code, we can ask what algebraic transformation it did that helped. We can then decide whether to do that transformation manually (in all languages).

But even if an algebraic transformation speeds up a benchmark, it doesn’t automatically mean it is worthwhile — the point of the benchmark is to measure how the same computations compare in different languages, not to compute e.g. π²/6 as quickly as possible. Benchmarks, almost by definition, are doing useless calculations to which you already know the answer.

4 Likes

I think that a blog post (or a series) by someone qualified about these issues would really help. Despite reading Goldberg (1991) and a couple of similar papers and book chapters, my standard method for learning about these issues is getting burned by them.

3 Likes

It’s a complicated subject, unfortunately, that is difficult to convey accurately in a few words. I really enjoyed this presentation by Kahan, which is nominally about Java but in fact is mainly about common misconceptions about floating-point arithmetic, with lots of interesting examples.

2 Likes