# Rounding the coefficients of a minimax polynomial

I’m making a Julia package for automating implementation of univariate real functions using polynomial approximation, but I’m having trouble with rounding the coefficients after a minimax polynomial is found, while preserving accuracy.

An example, some (close-to) minimax polynomial has max relative error of 3e-18, but then it takes much time to find a good rounding of its coefficients to Float64, and the max error of the Float64 polynomial turns out to be, e.g. 1.9e-16, while I’d prefer something closer to 1e-16 (almost correctly rounded result).

@Oscar_Smith in the PR where you added sinpi and cospi kernels for Float64 and Float32 arithmetics to Base, you mentioned that you did “lots of messing around with rounding to get things exact”, which, I think, referred to converting the BigFloat coefficients of the minimax polynomial to machine-precision coefficients.

Could you expand on that? Do you use Sollya’s fpminimax command? Or perhaps you used some Julia implementation of LLL (I think that fpminimax uses that lattice reduction stuff)? How long did it take for a good rounding to be found?

Fpminimax docs, for reference: Sollya User's Manual

My current approach (not very effective) is sampling numbers like this: rand(-31:31) and applying them to the coefficients with nextfloat.

My solution so far has been just rounding which isn’t perfect but works decently a lot of the time. All my coefficients could probably be further optimized.

Edit I accidentally edited this so now the conversation doesn’t make a ton of sense

1 Like

I thought Remez.jl just finds the minimax polynomial for you (with BigFloat coeffs)? What I’m having trouble with is converting the coefficients to Float64.

My solution so far has been just rounding which isn’t perfect but works decently a lot of the time. All my coefficients could probably be further optimized.

1 Like

OK, thank you! For me this means that I need to think about the evaluation schemes some more, apart from trying to find good roundings.

Don’t express polynomials as coefficients in the monomial basis \{1,x,x^2,x^3,\ldots\}, which is notorious for numerical instabilities at high degrees.

I would suggest expressing your polynomials in the basis of Chebyshev polynomials T_n(x) (after rescaling your x interval to [-1,1]). These are much better behaved because they are “cosines in disguise”, bounded between -1 and +1: if you perturb a Chebyshev coefficient by \epsilon, the polynomial is perturbed by at most |\epsilon| on x \in [-1,1].

They have lots of other nice properties, e.g. they can be evaluated quickly by a Clenshaw recurrence, they are exponentially convergent for smooth functions, and simply interpolating through the Chebyshev points gives you nearly the minimax polynomial (and in O(n \log n) operations via an FFT).

The ApproxFun.jl package is built on Chebyshev polynomials and related functions and provides a huge range of functionality beyond simply polynomial approximation, from root finding to solving PDEs. For the basid task of polynomial approximation (in one or many dimensions), another alternative is my FastChebInterp.jl package. If you need the “exact” minimax polynomial, you may want to build a Remez algorithm on top of one of these packages.

3 Likes

IMO this often isn’t a great solution because clenshaw is a decent amount slower than horner.

1 Like

Yeah, I’m actually investigating an evaluation scheme where I evaluate a factored form of the polynomial. This has potential to be faster than Clenshaw/Horner because of exploiting CPU out-of-order execution, because of less data dependencies, and also should be better for implementing with SIMD, I guess.

Except that I need to translate the polynomial by some good constant before factoring it, which seems to help with achieving good accuracy.

My package actually currently finds the minimax (or close to minimax) polynomial using linear programming instead of Remez. I’m not sure how good a choice that was yet, but it seems to be an OK option. It certainly seems to be more flexible than Remez in some ways.

FastChebInterp’s Clenshaw implementation is within a factor of 2 of evalpoly(x, coefs) (Horner) on my machine (where coefs is an array, not a tuple) for a degree of 20. And that’s without any particular effort to use SIMD.

(This is for a runtime array of coefficients. Static tuples of coefficients, where the whole polynomial evaluation is unrolled and inlined as in special-function implementations, are a separate ballgame.)

And of course the point is that if you try monomials/Horner up to degree 20 or higher it can quickly become a numerical disaster for many functions, so the performance is irrelevant. If you have a degree \lesssim 10, in contrast, then it could definitely make sense to convert back to monomial form to try to squeeze out a few more cycles.

2 Likes

Just by sampling at a fixed grid of points? Yes, that’s a lot easier algorithm though it is not exactly minimax (and you could always devise a function to defeat it); the whole complication of Remez is to figure out the sampling points over a continuous interval.

Yeah, but I think it may still be OK because I randomly perturb the points on the grid before doing LP. So if it fails to find a good enough polynomial, the user will always have the option of increasing the number of points or even just trying again with the same number of points.

1 Like

Was interested in testing this case (hoping I didn’t mess up the Clenshaw algorithm) where everything was unrolled with constant coefficients.

Clenshaw code and constants
function clenshaw_chebyshev(x, c)
x2 = 2x
c0 = c[end-1]
c1 = c[end]
for i in length(c)-2:-1:1
c0, c1 = c[i] - c1, muladd(c1, x2, c0)
end
end

const n5 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2)

const n10 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1)

const n25 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1, -0.09090909090909091, 0.08333333333333333, -0.07692307692307693, 0.07142857142857142, -0.06666666666666667, 0.0625, -0.058823529411764705, 0.05555555555555555, -0.05263157894736842, 0.05, -0.047619047619047616, 0.045454545454545456, -0.043478260869565216, 0.041666666666666664, -0.04)

const n50 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1, -0.09090909090909091, 0.08333333333333333, -0.07692307692307693, 0.07142857142857142, -0.06666666666666667, 0.0625, -0.058823529411764705, 0.05555555555555555, -0.05263157894736842, 0.05, -0.047619047619047616, 0.045454545454545456, -0.043478260869565216, 0.041666666666666664, -0.04, 0.038461538461538464, -0.037037037037037035, 0.03571428571428571, -0.034482758620689655, 0.03333333333333333, -0.03225806451612903, 0.03125, -0.030303030303030304, 0.029411764705882353, -0.02857142857142857, 0.027777777777777776, -0.02702702702702703, 0.02631578947368421, -0.02564102564102564, 0.025, -0.024390243902439025, 0.023809523809523808, -0.023255813953488372, 0.022727272727272728, -0.022222222222222223, 0.021739130434782608, -0.02127659574468085, 0.020833333333333332, -0.02040816326530612, 0.02)

Benchmark summary
# 5 degree
julia> @btime clenshaw_chebyshev(x, n5) setup=(x=rand())
2.355 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n5) setup=(x=rand())
2.353 ns (0 allocations: 0 bytes)

# 10 degree
julia> @btime clenshaw_chebyshev(x, n10) setup=(x=rand())
3.329 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n10) setup=(x=rand())
3.281 ns (0 allocations: 0 bytes)

# 25 degree
julia> @btime clenshaw_chebyshev(x, n25) setup=(x=rand())
10.777 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n25) setup=(x=rand())
8.826 ns (0 allocations: 0 bytes)

# 50 degree
julia> @btime clenshaw_chebyshev(x, n50) setup=(x=rand())
28.561 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n50) setup=(x=rand())
23.273 ns (0 allocations: 0 bytes)


So for degrees <10 Clenshaw is just a bit slower (probably need a better benchmark) and for larger degree polynomial the Clenshaw algorithm was about \approx 23\% slower than Horner scheme using evalpoly which is actually more comparable than I originally thought.

For high degree polynomials (>10), it might be helpful to implement the SIMD variant of the evalpoly function (which could be similarly employed for Clenshaw)

SIMD 2nd degree Horner

@inline function horner_degree2(x, P)
N = length(P)
xx = x*x

out = P[end]
out2 = P[end-1]

for i in N-2:-2:2
end
if iszero(rem(N, 2))
else
end
end

julia> @btime horner_degree2(x, n5) setup=(x=rand())
2.581 ns (0 allocations: 0 bytes)

julia> @btime horner_degree2(x, n10) setup=(x=rand())
2.585 ns (0 allocations: 0 bytes)

julia> @btime horner_degree2(x, n25) setup=(x=rand())
5.258 ns (0 allocations: 0 bytes)

julia> @btime horner_degree2(x, n50) setup=(x=rand())
13.268 ns (0 allocations: 0 bytes)


So the 50 degree polynomial using the SIMD Horner scheme (unrolling factor of 2) is 43 \% faster achieving close to the optimal 2x speedup.

I think it should be said that the SIMD variants will produce slightly different results compared to the regular Horner’s scheme (\approx 0.5 ULP) with second order schemes producing slightly higher errors between \approx 0.5 - 1.5 ULP). There are a couple instances where the second order scheme might be more accurate in alternating series but that depends on the coefficients. Though this can be tested if coefficients are known.

I’m not for sure it’s always optimal to use the SIMD implementation as it depends on the application.
Are we evaluating the same polynomial a million times (might be best to SIMD individual evaluations)?
Or are we evaluating a high degree polynomial just a couple of times (probably best to SIMD the polynomial evaluation)?

This also comes up in higher order Chebyshev basis like approximating 2D functions where it might be best to SIMD the evaluations in the second dimension… (should probably make sure this is happening in some of my code…)

2 Likes

Yes, it would be nice to accelerate FastChebInterp.jl for this (since the main reason I write FastChebInterp was to go to higher dimensions, whereas I couldn’t use ApproxFun in more than 2d).

1 Like

@Oscar_Smith I wonder how did you come by the coefficient 1.2245907532225998e-16 here? I’d expect it to be the result of Float64(BigFloat(pi) - Float64(pi)), but that doesn’t seem to be the case, for any BigFloat precision.

I believe it’s a minimax coeficient so it’s the low bits of the first coeficient in Faster and more accurate sinpi, cospi, sincospi for Float32, Float64 by oscardssmith · Pull Request #41744 · JuliaLang/julia · GitHub

1 Like

Oh so it’s Float64(first_minimax_coefficient - Float64(first_minimax_coefficient)). Thanks!

1 Like

@Oscar_Smith hope I’m not bothering you too much, but I’m trying to reverse engineer your construction of Base.Math.sinpi_kernel(::Float64) and I’m having trouble with understanding something, so I hope you could shed some more insight.

Consider these two functions:

function sinpi_kernel(x::Float64)
x² = x*x
x⁴ = x²*x²
r  = evalpoly(x², (2.5501640398773415, -0.5992645293202981, 0.08214588658006512,
-7.370429884921779e-3, 4.662827319453555e-4, -2.1717412523382308e-5))
end

function sinpi_kernel_simpler(x::Float64)
x² = x*x
p = evalpoly(x², (3.141592653589793, -5.16771278004997,
2.5501640398773415, -0.5992645293202981, 0.08214588658006512,
-7.370429884921779e-3, 4.662827319453555e-4, -2.1717412523382308e-5))

return muladd(x, p, x * 1.2245907532225998e-16)
end


sinpi_kernel is exactly copied from your implementation in Base, and sinpi_kernel_simpler is a slight simplification of sinpi_kernel. I was hoping that the two will have very similar accuracy, given that:

1. both functions seem to use FMA to the fullest so as to avoid catastrophic cancellation
2. both functions include both the high and the low parts of the split first coefficient of the minimax polynomial so as to lessen the error caused by representing the coefficient in Float64

However, your sin_kernel has significantly better maximum error over the interval [0, 0.25]:

sinpi_kernel relative error: 1.224929786588653e-16 = 0.5516593330435678 ulp
sinpi_kernel_simpler relative error: 1.9164026353462354e-16 = 0.8630710194437142 ulp


A naive implementation that doesn’t use the low part of the first minimax coefficient actually has max error of 1.04081 ulp, so sinpi_kernel_simpler is actually closer in that respect to the naive implementation than to your sinpi_kernel.

Do you perhaps have any idea what’s the reason for sinpi_kernel’s superior accuracy? In particular, I’m wondering whether your evaluation scheme is more accurate in general, or if which scheme is better depends on the values of the polynomial coefficients.

The analysis here is pretty tricky but essentially what you need to do is keep track of the size of all of your components through the algorithm. Specifically, sinpi_kernel_simpler has the problem that the last it isn’t compensating anything. p will have a size of roughly pi so eps(p) is on the order of 4e-16. As such the 1e-16 correction won’t do anything. You’ve already lost the precision. The way sinpi_kernel works, x⁴*r<=.0994 so the 1e-16 term can still have an effect, and then that whole muladd is put within another one which has a size of roughly 0.32.

The TLDR is that you need to analyze at each step what the size of the output is because that size will determine the size of the error.

2 Likes