In my quest for a better exp function, I discovered that having a more accurate evalpoly function is necessary if you want low error. Our builtin version is very fast (1 flop per degree), but has an unfortunately high max error (near 1 ULP). Below is a version that uses extended precision to get below .5ULP (at least when the higher order terms tend to zero). The biggest downside is it is 6x slower, but this is still faster than evaluation with higher precision arithmetic from my testing.
If this should go in a package DoubleFloats? Iād be happy to make a PR.
@inline function exthorner(x, p::Tuple)
hi, lo = p[end], zero(x)
for i in length(p)-1:-1:1
pi = p[i]
prod = hi*x
err = fma(hi, x, -prod)
hi = pi+prod
lo = fma(lo, x, prod - (hi - pi) + err)
end
return hi, lo
end
I should probably mention that this isnāt fully at double float precision. By making some reasonable assumptions about what the errors and polynomial sizes will be, you get more than native precision (my goal) while being about 2x faster than a fully correctly double float accurate implementation.
As far as I can tell, you almost implemented a compensated Horner algorithm, only with the notable difference that the error term prod - (hi - pi) is only guaranteed to be exact if abs(prod) < abs(pi). Iām not sure whether this is the assumption you were thinking about but if so, then under this assumption you have a compensated Horner algorithm which should be as accurate as a standard Horner scheme using twice the native precision: https://www.sciencedirect.com/science/article/pii/S0898122108001120
If this assumption does not hold in practice, maybe we should try and measure how much performance would be lost if we wanted to accurately compute that error term (i.e. the one related to the hi = pi+prod operation) in all cases. Or is that what you already tested and refer to as a āfully correctly double float accurate implementationā?
https://cadxfem.org/cao/Compensation-horner.pdf was the reference I was using.
The assumption I made was that the higher order terms are smaller than the current polynomial coeficient, ie |p[i]|>= |evalpoly(x, p[i+1:end])|. This is true for |x|<1 and monotonically decreasing |p[i]|. The good news is that that is the scenario that you encounter for evaluating special functions.
@inline function two_sum(a, b)
hi = a + b
a1 = hi - b
b1 = hi - a1
lo = (a - a1) + (b - b1)
return hi, lo
end
@inline function exthorner(x, p::Tuple)
hi, lo = p[end], zero(x)
for i in length(p)-1:-1:1
pi = p[i]
prod = hi*x
err1 = fma(hi, x, -prod)
hi,err2 = two_sum(pi,prod)
lo = fma(lo, x, err1 + err2)
end
return hi, lo
end
It should be about another 2x slowdown, but will retain precision for polynomials where high order terms are bigger.
I havenāt tested as extensively as I would want to to advertise that, but I definitely am seeing much better than .5 ULP which was all I really needed.
I see. I think I have some code somewhere generating ill-conditioned polynomial evaluations. If I have some time and I manage to find it, Iāll try to test your version more extensively.