Data approximation with a curve

I have a dataset with 130k points. I want to approximate them with some polynomial p(not more than 11th degree for example), so that later i could call y = p(x) for x \in [-90.,90] to get approximated value. Cubic splines are also suitable.

Or to smooth initial data with “black lined data”, so that later i could find the nearest y value to given x.

With all varieties of packages i failed to find corresponding example.

What is the best solution for such problem?

Can you say why the curved line and not say a straight line is the right approximation? Does this mean that you have one-sided errors? What is the marginal distribution of the data say at x = 50

Approximation could not be a linear function due to physical model of this experiment.
And I find difficulty in calculation marginal distribution of this experimental data.

Data set near x = 50 is:

 (50.06190241859059, 0.5529411764705883)
 (50.06271206783095, 0.4411764705882353)
 (50.06287061681003, 0.5901960784313725)
 (50.063813056724726, 0.6254901960784314)
 (50.06462230244647, 0.6274509803921569)
 (50.065513860109654, 0.5392156862745098)
 (50.068873125880664, 0.6039215686274509)
 (50.07026625458975, 0.41568627450980394)
 (50.072331412425726, 0.46078431372549017)
 (50.07273857276312, 0.48235294117647054)
 (50.073249126363805, 0.6274509803921569)
 (50.07550475759751, 0.5098039215686274)
 (50.07586823841464, 0.5784313725490196)
 (50.081933993590376, 0.6235294117647059)
 (50.08232453457886, 0.5901960784313725)
 (50.08241485367218, 0.5686274509803921)
 (50.0895941522953, 0.5176470588235293)
 (50.093424325865975, 0.6215686274509804)
 (50.094941222608945, 0.6294117647058823)
 (50.09657226932971, 0.5980392156862745)
 (50.09698066420261, 0.603921568627451)
 (50.097922793054686, 0.588235294117647)
 (50.09912777308016, 0.6254901960784314)
 (50.10358675193727, 0.615686274509804)
 (50.10474992059791, 0.4235294117647059)
 (50.10542177083563, 0.596078431372549)
 (50.10566150221772, 0.5529411764705883)
 (50.107687002735915, 0.6176470588235294)
 (50.10977525190087, 0.5980392156862745)
 (50.110021374204145, 0.6176470588235294)
 (50.11086899429362, 0.5196078431372548)
 (50.11122246013789, 0.5078431372549019)
 (50.11188542972433, 0.603921568627451)
 (50.11368434011527, 0.40980392156862744)
 (50.11445099225463, 0.6215686274509804)
 (50.117902172346554, 0.5294117647058824)
 (50.11828276154756, 0.607843137254902)
 (50.12026213093516, 0.615686274509804)
 (50.12146711147272, 0.6058823529411765)
 (50.12255959984209, 0.484313725490196)
 (50.123404825843664, 0.5725490196078431)

If, at least that is what I gather from your post, you want to do, you can do so with

data = [...
        (50.06271206783095, 0.4411764705882353)
        (50.06287061681003, 0.5901960784313725)
        (50.063813056724726, 0.6254901960784314)
        (50.06462230244647, 0.6274509803921569)

using Polynomials
x = first.(data)
y = last.(data)
p = polyfit(x, y, 11)

Polynomial fitting is quite dangerous without regularisation - you can end up with nasty oscillations and ill defined behaviour.

You may want to look at ApproxFun, which uses either a Fourier or Chebyshev polynomial basis, and avoids some of the problems. The package is really nice! And very powerful once you’ve created your approximations.

1 Like

Andrey has 130000 data points on the line and 10 parameters, maybe he can worry about regularisation a different time. Looking at the skew residuals I am more concerned about whether a least square fit gives anything meaningful here.

1 Like

I had have a look at this package, but hadn’t find appropriate example. It would rather solve a PDE than my problem.

And yes, i hadn’t said that, but it is supposed that “curve” fits data well in some sense(least squares or other regularisation).

1 Like

Looks like I’ve succeeded with your solution.

  • orange curve is deg = 3
  • yellow curve is deg = 5
  • red curve is deg = 11
  • green curve is deg = 20


Nice, I do not know what goes wrong with higher orders. Maybe we are actually in @jarvist’s “dangerous” territory already. On the other hand, if I try this with mock data all seems fine (order 11):

@dlfivefifty @jarvist Thx, i managed to solve my problem also with ApproxFun.


1 Like

Oh wow, I did not notice you had 130’000 data points!
The physicist in me would recommend applying some kind of boxcar averaging or similar smoothing function to get rid of the high frequency noise before throwing the data into the linear-algebra curve-fitting cauldron.

Why should we filter any noise? I thought that the more data we have, the more accurate results we will obtain during regularisation.

Least squares methods try to minimize the square of the distance. If you have large noise, the noisiest parts of the signal will have the largest pull on the fit and if several parameters are to be fitted, there can be large fluctuations on the fitted parameters.

In my experience the closer to the expected curve you are the more robust your fit. Filtering and smoothing appropriately really helps.

1 Like

My understanding is that an averaging based smoothing procedure is equivalent to using an L1 norm in your fitting. As Paulo says, this makes it more robust to extreme values.

Generally smoothing is done to increase the signal to noise ratio. A Fourier transform of your data (the power spectrum) would show an enormous contribution in the high frequency component. We know this is noise because its unphysical (too much energy density).
If you smooth your data, you will also be able to plot the signal. Currently your plot is just showing the extreme values in whatever interval ends up as 1 pixel on the x-axis.

Seeing that your data is periodic, and looks mostly sinusoidal, you almost certainly should be using a Fourier basis for the fit.

1 Like

It might be worth trying to plot a random sample (maybe ~1300 or so observations) in a point-plotting or scatterplot type mode and a small marker size.

+1 for the scatter plot suggestion. It often helps to understand your data better before you go crazy on curve fitting. Could you tell us what kind of data this is? The noise looks very peculiar to me. The upper and lower limits of the noise seem far too regular compared to the variance of the noise, or maybe I should say the amplitude of the high frequency component, or whatever it is.

It is dependence of tangent angle to signal intensity(acquired from electronic microscope).

Thx for explanation.