Fitting a polynomial of degree 10 to data

valperez · November 4, 2021, 2:16am

Hi there

I’m trying to test Julia’s accuracy so I’m trying to fit a polynomial of degree 10 to my data. The data can be found in the following link: https://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Filip.dat

I want to find the coefficients that best fit the problem
y = \sum_{j = 0}^{j= 10 } beta_j * x^j + epsilon

I uploaded and named my data filip and I have a function called “poly” that creates a polynomial a certain degree.

So far I have tried:

fit(LinearModel, @formula(y ~ 1 + poly(x,10)), filip)
fm = @formula(y ~ 1 + poly(x, 10)) ; filip_model = glm(fm, filip, Normal())
fm = @formula(y ~ 1 + poly(x, 10)); filip_model = lm(fm, filip)

The problem is that I have the answer to the problem jaja is in this link: StRD Certified Values for Dataset Filip

However, I want to know how well julia gets the decimals in the coeficients.

Thank you in advance

gustaphe · November 4, 2021, 4:23am

Those are not base functions, which package are you using?

nilshg · November 4, 2021, 12:16pm

I’m assuming OP is using GLM with the additional poly function described in the docs here.

Do I understand the question correctly as being about whether GLM.jl can replicate the coefficient values in given on the certification page you are linking? If so it appears it can’t:

julia> lm(@formula(y ~ x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9 + x^10), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                   Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0          NaN            NaN       NaN     NaN            NaN
x            0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 3        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 4        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 5        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 6        0.00368644     0.000219028   16.83    <1e-26    0.0032503      0.00412258
x ^ 7        0.00191731     0.000127438   15.05    <1e-23    0.00166355     0.00217107
x ^ 8        0.000375885    2.75293e-5    13.65    <1e-21    0.000321067    0.000430703
x ^ 9        3.28191e-5     2.61667e-6    12.54    <1e-19    2.76087e-5     3.80296e-5
x ^ 10       1.07467e-6     9.23624e-8    11.64    <1e-17    8.90753e-7     1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────

feeding the same data into R I get:

Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + 
    I(x^7) + I(x^8) + I(x^9) + I(x^10), data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0099087 -0.0024610  0.0003385  0.0020743  0.0071654 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.743e+02  8.756e+01  -1.990 0.050345 .  
x           -3.269e+02  1.480e+02  -2.208 0.030436 *  
I(x^2)      -2.661e+02  1.095e+02  -2.429 0.017617 *  
I(x^3)      -1.239e+02  4.652e+01  -2.664 0.009534 ** 
I(x^4)      -3.638e+01  1.251e+01  -2.907 0.004845 ** 
I(x^5)      -6.979e+00  2.211e+00  -3.156 0.002333 ** 
I(x^6)      -8.747e-01  2.567e-01  -3.407 0.001079 ** 
I(x^7)      -6.906e-02  1.890e-02  -3.654 0.000487 ***
I(x^8)      -3.118e-03  8.009e-04  -3.894 0.000219 ***
I(x^9)      -6.139e-05  1.489e-05  -4.123 9.91e-05 ***
I(x^10)             NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003768 on 72 degrees of freedom
Multiple R-squared:  0.9958,	Adjusted R-squared:  0.9953 
F-statistic:  1895 on 9 and 72 DF,  p-value: < 2.2e-16

which seems somewhat closer but not quite there.

Maybe the problem is nearly rank-deficient, in which case GLM might behave differently from other software, but there might not necessarily be a “correct” approach to this issue always and everywhere. In this case it appears though that there is an “official” correct solution, maybe @nalimilan is interested in having a look at what’s happening here.

pdeffebach · November 4, 2021, 12:21pm

GLM’s cholesky call has a somewhat lower tolerance than R’s. We need to add an option so users can change tolerance themselves in the lm call.

John_Gibson · November 4, 2021, 1:25pm

Fitting polynomials by solving for the coefficients of monomials is generally not a good idea. It generates poorly conditioned matrices and thus large errors in coefficients because the shapes of the curves x^n are very similar. Instead it’s better to express the polynomial as a linear combination of Chebyshev polynomials or Legendre polynomials, which are orthogonal to each other (in an inner product over the real interval and with proper weighting function, and at least strongly linearly independent when discretized).

This isn’t an esoteric problem. If you’re having getting good fits to a 10th order polynomial with order 10 data points, it’s likely your problem.

nilshg · November 4, 2021, 2:11pm

Not much point continuing the discussion without OP, but if I understand his post and the linked page correctly, this is a deliberately unstable problem which is used to validate statistical software against some known result. Now of course I don’t know how that known result has been derived and whether it’s in any way better than what R or GLM.jl give, but I think there’s little point arguing with the problem formulation.

j_verzani · November 4, 2021, 4:24pm

Fitting using Polynomials is easy to do. I have x,y as variables and as as the given coefficients in the link:

using Polynomiales
p = fit(x,y, 10)
pa = fit(ArnoldiFit, x, y, 10)
q = Polynomial(as)

Then

julia> p - q
Polynomial(-4.5882096856075805e-5 - 8.77541174304497e-5*x - 7.449668601111625e-5*x^2 - 3.696358521665388e-5*x^3 - 1.187106710176522e-5*x^4 - 2.57854703988869e-6*x^5 - 3.836783157851187e-7*x^6 - 3.8622812281730035e-8*x^7 - 2.5178570339789985e-9*x^8 - 9.601549980434165e-11*x^9 - 1.6269660925026692e-12*x^10)

which are within the bounds given, on first glance. The ArnoldiFit is a better match to the given coefficients though:

julia> maximum(abs(pa(x) - q(x)) for x ∈ range(0,1, length=1000))
7.275957614183426e-12

julia> maximum(abs(p(x) - q(x)) for x ∈ range(0,1, length=1000))
0.00025997101602115436

valperez · November 4, 2021, 5:27pm

This worked!

The only thing I changed was that I had to use p = Polynomials.fit(x,y, 10) and then q = Polynomial(as) and when I did the maximum(abs(p(x) - q(x)) for x ∈ range(0,1, length=1000)), the result gave me 0.0003723332774825394

I can’t tell you how grateful I am! thank you so so so much

valperez · November 4, 2021, 6:17pm

But they are very similar coefficients. I still have to figure out the standar deviation of the estimations.

rafael.guerra · November 4, 2021, 6:27pm

Sorry my test seems to be bad one, will delete post.

NB:
Leave a picture just to illustrate the problem solved by @j_verzani:

Topic		Replies	Views
Problems with simple linear model (R and Julia comparison) General Usage	3	93	March 21, 2025
Polynomial problem-help me please New to Julia question , homework	8	589	October 19, 2023
GLM - inconsistency with R on ISLR dataset? Performance question , glm	2	555	January 30, 2022
Polyfit (in Polynomials.jl ?) not working in new Julia version? General Usage polynomial	17	1315	January 26, 2024
Overfitting polynomial. help New to Julia question , homework	3	325	October 19, 2023

Fitting a polynomial of degree 10 to data

Related topics