Fitting a polynomial of degree 10 to data

Hi there :slight_smile:

I’m trying to test Julia’s accuracy so I’m trying to fit a polynomial of degree 10 to my data. The data can be found in the following link: https://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Filip.dat

I want to find the coefficients that best fit the problem
y = \sum_{j = 0}^{j= 10 } beta_j * x^j + epsilon

I uploaded and named my data filip and I have a function called β€œpoly” that creates a polynomial a certain degree.

So far I have tried:

  1. fit(LinearModel, @formula(y ~ 1 + poly(x,10)), filip)

  2. fm = @formula(y ~ 1 + poly(x, 10)) ; filip_model = glm(fm, filip, Normal())

  3. fm = @formula(y ~ 1 + poly(x, 10)); filip_model = lm(fm, filip)

The problem is that I have the answer to the problem jaja is in this link: StRD Certified Values for Dataset Filip

However, I want to know how well julia gets the decimals in the coeficients.

Thank you in advance :slight_smile:

1 Like

Those are not base functions, which package are you using?

I’m assuming OP is using GLM with the additional poly function described in the docs here.

Do I understand the question correctly as being about whether GLM.jl can replicate the coefficient values in given on the certification page you are linking? If so it appears it can’t:

julia> lm(@formula(y ~ x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9 + x^10), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                   Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0          NaN            NaN       NaN     NaN            NaN
x            0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 3        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 4        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 5        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 6        0.00368644     0.000219028   16.83    <1e-26    0.0032503      0.00412258
x ^ 7        0.00191731     0.000127438   15.05    <1e-23    0.00166355     0.00217107
x ^ 8        0.000375885    2.75293e-5    13.65    <1e-21    0.000321067    0.000430703
x ^ 9        3.28191e-5     2.61667e-6    12.54    <1e-19    2.76087e-5     3.80296e-5
x ^ 10       1.07467e-6     9.23624e-8    11.64    <1e-17    8.90753e-7     1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────

feeding the same data into R I get:

Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + 
    I(x^7) + I(x^8) + I(x^9) + I(x^10), data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0099087 -0.0024610  0.0003385  0.0020743  0.0071654 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.743e+02  8.756e+01  -1.990 0.050345 .  
x           -3.269e+02  1.480e+02  -2.208 0.030436 *  
I(x^2)      -2.661e+02  1.095e+02  -2.429 0.017617 *  
I(x^3)      -1.239e+02  4.652e+01  -2.664 0.009534 ** 
I(x^4)      -3.638e+01  1.251e+01  -2.907 0.004845 ** 
I(x^5)      -6.979e+00  2.211e+00  -3.156 0.002333 ** 
I(x^6)      -8.747e-01  2.567e-01  -3.407 0.001079 ** 
I(x^7)      -6.906e-02  1.890e-02  -3.654 0.000487 ***
I(x^8)      -3.118e-03  8.009e-04  -3.894 0.000219 ***
I(x^9)      -6.139e-05  1.489e-05  -4.123 9.91e-05 ***
I(x^10)             NA         NA      NA       NA    
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 0.003768 on 72 degrees of freedom
Multiple R-squared:  0.9958,	Adjusted R-squared:  0.9953 
F-statistic:  1895 on 9 and 72 DF,  p-value: < 2.2e-16

which seems somewhat closer but not quite there.

Maybe the problem is nearly rank-deficient, in which case GLM might behave differently from other software, but there might not necessarily be a β€œcorrect” approach to this issue always and everywhere. In this case it appears though that there is an β€œofficial” correct solution, maybe @nalimilan is interested in having a look at what’s happening here.

GLM’s cholesky call has a somewhat lower tolerance than R’s. We need to add an option so users can change tolerance themselves in the lm call.

Fitting polynomials by solving for the coefficients of monomials is generally not a good idea. It generates poorly conditioned matrices and thus large errors in coefficients because the shapes of the curves x^n are very similar. Instead it’s better to express the polynomial as a linear combination of Chebyshev polynomials or Legendre polynomials, which are orthogonal to each other (in an inner product over the real interval and with proper weighting function, and at least strongly linearly independent when discretized).

This isn’t an esoteric problem. If you’re having getting good fits to a 10th order polynomial with order 10 data points, it’s likely your problem.

1 Like

Not much point continuing the discussion without OP, but if I understand his post and the linked page correctly, this is a deliberately unstable problem which is used to validate statistical software against some known result. Now of course I don’t know how that known result has been derived and whether it’s in any way better than what R or GLM.jl give, but I think there’s little point arguing with the problem formulation.

6 Likes

Fitting using Polynomials is easy to do. I have x,y as variables and as as the given coefficients in the link:

using Polynomiales
p = fit(x,y, 10)
pa = fit(ArnoldiFit, x, y, 10)
q = Polynomial(as)

Then

julia> p - q
Polynomial(-4.5882096856075805e-5 - 8.77541174304497e-5*x - 7.449668601111625e-5*x^2 - 3.696358521665388e-5*x^3 - 1.187106710176522e-5*x^4 - 2.57854703988869e-6*x^5 - 3.836783157851187e-7*x^6 - 3.8622812281730035e-8*x^7 - 2.5178570339789985e-9*x^8 - 9.601549980434165e-11*x^9 - 1.6269660925026692e-12*x^10)

which are within the bounds given, on first glance. The ArnoldiFit is a better match to the given coefficients though:

julia> maximum(abs(pa(x) - q(x)) for x ∈ range(0,1, length=1000))
7.275957614183426e-12

julia> maximum(abs(p(x) - q(x)) for x ∈ range(0,1, length=1000))
0.00025997101602115436
5 Likes

This worked!

The only thing I changed was that I had to use p = Polynomials.fit(x,y, 10) and then q = Polynomial(as) and when I did the maximum(abs(p(x) - q(x)) for x ∈ range(0,1, length=1000)), the result gave me 0.0003723332774825394

I can’t tell you how grateful I am! thank you so so so much :slight_smile:

1 Like

But they are very similar coefficients. I still have to figure out the standar deviation of the estimations.

Sorry my test seems to be bad one, will delete post.

NB:
Leave a picture just to illustrate the problem solved by
@j_verzani:

1 Like