More accurate evalpoly

In my quest for a better exp function, I discovered that having a more accurate evalpoly function is necessary if you want low error. Our builtin version is very fast (1 flop per degree), but has an unfortunately high max error (near 1 ULP). Below is a version that uses extended precision to get below .5ULP (at least when the higher order terms tend to zero). The biggest downside is it is 6x slower, but this is still faster than evaluation with higher precision arithmetic from my testing.
If this should go in a package DoubleFloats? I’d be happy to make a PR.

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err = fma(hi, x, -prod)
        hi = pi+prod
        lo = fma(lo, x, prod - (hi - pi) + err)
    end
    return hi, lo
end
6 Likes

I should probably mention that this isn’t fully at double float precision. By making some reasonable assumptions about what the errors and polynomial sizes will be, you get more than native precision (my goal) while being about 2x faster than a fully correctly double float accurate implementation.

<2x slower for me with a degree 6 test polynomial (7 coefficients).

julia> function evalpolyloop!(f, y, x)
           @inbounds for i ∈ eachindex(y,x)
               y[i] = f(x[i], (0.6666666666667333541, 0.3999999999635251990, 0.2857142932794299317, 0.2222214519839380009, 0.1818605932937785996, 0.1525629051003428716, 0.1532076988502701353))
           end
       end
evalpolyloop! (generic function with 1 method)

julia> x = rand(256) .+ 0.5; y = similar(x);

julia> @inline function exthorner(x, p::Tuple)
           hi, lo = p[end], zero(x)
           for i in length(p)-1:-1:1
               pi = p[i]
               prod = hi*x
               err = fma(hi, x, -prod)
               hi = pi+prod
               lo = fma(lo, x, prod - (hi - pi) + err)
           end
           return hi, lo
       end
exthorner (generic function with 1 method)

julia> @benchmark evalpolyloop!(evalpoly, $y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.663 ns (0.00% GC)
  median time:      37.959 ns (0.00% GC)
  mean time:        38.003 ns (0.00% GC)
  maximum time:     90.504 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark evalpolyloop!((a,b) -> first(exthorner(a,b)), $y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     63.381 ns (0.00% GC)
  median time:      63.487 ns (0.00% GC)
  mean time:        63.574 ns (0.00% GC)
  maximum time:     99.733 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     980
5 Likes

As far as I can tell, you almost implemented a compensated Horner algorithm, only with the notable difference that the error term prod - (hi - pi) is only guaranteed to be exact if abs(prod) < abs(pi). I’m not sure whether this is the assumption you were thinking about but if so, then under this assumption you have a compensated Horner algorithm which should be as accurate as a standard Horner scheme using twice the native precision:

If this assumption does not hold in practice, maybe we should try and measure how much performance would be lost if we wanted to accurately compute that error term (i.e. the one related to the hi = pi+prod operation) in all cases. Or is that what you already tested and refer to as a “fully correctly double float accurate implementation”?

5 Likes

https://cadxfem.org/cao/Compensation-horner.pdf was the reference I was using.
The assumption I made was that the higher order terms are smaller than the current polynomial coeficient, ie |p[i]|>= |evalpoly(x, p[i+1:end])|. This is true for |x|<1 and monotonically decreasing |p[i]|. The good news is that that is the scenario that you encounter for evaluating special functions.

2 Likes

The fully accurate version is

@inline function two_sum(a, b)
    hi = a + b
    a1 = hi - b
    b1 = hi - a1
    lo = (a - a1) + (b - b1)
    return hi, lo
end

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err1 = fma(hi, x, -prod)
        hi,err2 = two_sum(pi,prod)
        lo = fma(lo, x, err1 + err2)
    end
    return hi, lo
end

It should be about another 2x slowdown, but will retain precision for polynomials where high order terms are bigger.

1 Like

OK. Then we’re speaking of exactly the same assumption :slight_smile:

OK, then I entirely follow you! And with the “not fully compensated” version, don’t you observe a doubling of the accuracy in practice?

1 Like

I haven’t tested as extensively as I would want to to advertise that, but I definitely am seeing much better than .5 ULP which was all I really needed.

1 Like

I see. I think I have some code somewhere generating ill-conditioned polynomial evaluations. If I have some time and I manage to find it, I’ll try to test your version more extensively.

2 Likes