Data approximation with a curve

Andrey.Borzunov · October 10, 2018, 4:46pm

I have a dataset with 130k points. I want to approximate them with some polynomial p(not more than 11th degree for example), so that later i could call y = p(x) for x \in [-90.,90] to get approximated value. Cubic splines are also suitable.

Or to smooth initial data with “black lined data”, so that later i could find the nearest y value to given x.
newplot%20(1)

With all varieties of packages i failed to find corresponding example.

What is the best solution for such problem?
Thx!

mschauer · October 10, 2018, 5:00pm

Can you say why the curved line and not say a straight line is the right approximation? Does this mean that you have one-sided errors? What is the marginal distribution of the data say at x = 50

Andrey.Borzunov · October 10, 2018, 5:20pm

Approximation could not be a linear function due to physical model of this experiment.
And I find difficulty in calculation marginal distribution of this experimental data.

Data set near x = 50 is:

 (50.06190241859059, 0.5529411764705883)
 (50.06271206783095, 0.4411764705882353)
 (50.06287061681003, 0.5901960784313725)
 (50.063813056724726, 0.6254901960784314)
 (50.06462230244647, 0.6274509803921569)
 (50.065513860109654, 0.5392156862745098)
 (50.068873125880664, 0.6039215686274509)
 (50.07026625458975, 0.41568627450980394)
 (50.072331412425726, 0.46078431372549017)
 (50.07273857276312, 0.48235294117647054)
 (50.073249126363805, 0.6274509803921569)
 (50.07550475759751, 0.5098039215686274)
 (50.07586823841464, 0.5784313725490196)
 (50.081933993590376, 0.6235294117647059)
 (50.08232453457886, 0.5901960784313725)
 (50.08241485367218, 0.5686274509803921)
 (50.0895941522953, 0.5176470588235293)
 (50.093424325865975, 0.6215686274509804)
 (50.094941222608945, 0.6294117647058823)
 (50.09657226932971, 0.5980392156862745)
 (50.09698066420261, 0.603921568627451)
 (50.097922793054686, 0.588235294117647)
 (50.09912777308016, 0.6254901960784314)
 (50.10358675193727, 0.615686274509804)
 (50.10474992059791, 0.4235294117647059)
 (50.10542177083563, 0.596078431372549)
 (50.10566150221772, 0.5529411764705883)
 (50.107687002735915, 0.6176470588235294)
 (50.10977525190087, 0.5980392156862745)
 (50.110021374204145, 0.6176470588235294)
 (50.11086899429362, 0.5196078431372548)
 (50.11122246013789, 0.5078431372549019)
 (50.11188542972433, 0.603921568627451)
 (50.11368434011527, 0.40980392156862744)
 (50.11445099225463, 0.6215686274509804)
 (50.117902172346554, 0.5294117647058824)
 (50.11828276154756, 0.607843137254902)
 (50.12026213093516, 0.615686274509804)
 (50.12146711147272, 0.6058823529411765)
 (50.12255959984209, 0.484313725490196)
 (50.123404825843664, 0.5725490196078431)

mschauer · October 10, 2018, 5:56pm

If, at least that is what I gather from your post, you want to do Polynomial regression - Wikipedia, you can do so with https://github.com/JuliaMath/Polynomials.jl:

data = [...
        (50.06271206783095, 0.4411764705882353)
        (50.06287061681003, 0.5901960784313725)
        (50.063813056724726, 0.6254901960784314)
        (50.06462230244647, 0.6274509803921569)
         ...]

using Polynomials
x = first.(data)
y = last.(data)
p = polyfit(x, y, 11)
p.(x)

jarvist · October 10, 2018, 8:34pm

Polynomial fitting is quite dangerous without regularisation - you can end up with nasty oscillations and ill defined behaviour.

You may want to look at ApproxFun, which uses either a Fourier or Chebyshev polynomial basis, and avoids some of the problems. The package is really nice! And very powerful once you’ve created your approximations.

mschauer · October 10, 2018, 9:11pm

Andrey has 130000 data points on the line and 10 parameters, maybe he can worry about regularisation a different time. Looking at the skew residuals I am more concerned about whether a least square fit gives anything meaningful here.

Andrey.Borzunov · October 11, 2018, 7:08am

I had have a look at this package, but hadn’t find appropriate example. It would rather solve a PDE than my problem.

And yes, i hadn’t said that, but it is supposed that “curve” fits data well in some sense(least squares or other regularisation).

dlfivefifty · October 11, 2018, 7:26am

http://juliaapproximation.github.io/ApproxFun.jl/stable/faq.html#Approximating-functions-1

Andrey.Borzunov · October 11, 2018, 7:27am

Looks like I’ve succeeded with your solution.

orange curve is deg = 3
yellow curve is deg = 5
red curve is deg = 11
green curve is deg = 20

newplot

mschauer · October 11, 2018, 7:54am

Nice, I do not know what goes wrong with higher orders. Maybe we are actually in @jarvist’s “dangerous” territory already. On the other hand, if I try this with mock data all seems fine (order 11):

Andrey.Borzunov · October 11, 2018, 8:08am

@dlfivefifty @jarvist Thx, i managed to solve my problem also with ApproxFun.

newplot%20(1)

jarvist · October 11, 2018, 8:40am

Oh wow, I did not notice you had 130’000 data points!
The physicist in me would recommend applying some kind of boxcar averaging or similar smoothing function to get rid of the high frequency noise before throwing the data into the linear-algebra curve-fitting cauldron.

Andrey.Borzunov · October 11, 2018, 9:03am

Why should we filter any noise? I thought that the more data we have, the more accurate results we will obtain during regularisation.

Paulo_Jabardo · October 11, 2018, 10:46am

Least squares methods try to minimize the square of the distance. If you have large noise, the noisiest parts of the signal will have the largest pull on the fit and if several parameters are to be fitted, there can be large fluctuations on the fitted parameters.

In my experience the closer to the expected curve you are the more robust your fit. Filtering and smoothing appropriately really helps.

jarvist · October 11, 2018, 10:58am

My understanding is that an averaging based smoothing procedure is equivalent to using an L1 norm in your fitting. As Paulo says, this makes it more robust to extreme values.

Generally smoothing is done to increase the signal to noise ratio. A Fourier transform of your data (the power spectrum) would show an enormous contribution in the high frequency component. We know this is noise because its unphysical (too much energy density).
If you smooth your data, you will also be able to plot the signal. Currently your plot is just showing the extreme values in whatever interval ends up as 1 pixel on the x-axis.

Seeing that your data is periodic, and looks mostly sinusoidal, you almost certainly should be using a Fourier basis for the fit.

y4lu · October 12, 2018, 9:41am

It might be worth trying to plot a random sample (maybe ~1300 or so observations) in a point-plotting or scatterplot type mode and a small marker size.

NiclasMattsson · October 13, 2018, 8:14am

+1 for the scatter plot suggestion. It often helps to understand your data better before you go crazy on curve fitting. Could you tell us what kind of data this is? The noise looks very peculiar to me. The upper and lower limits of the noise seem far too regular compared to the variance of the noise, or maybe I should say the amplitude of the high frequency component, or whatever it is.

Andrey.Borzunov · October 18, 2018, 10:37pm

It is dependence of tangent angle to signal intensity(acquired from electronic microscope).

Andrey.Borzunov · October 18, 2018, 10:38pm

Thx for explanation.

Topic		Replies	Views
Hi! I have a function, let's say `f(ϕ) = sin(ϕ), ϕ = 0...10`. Now I want to break General Usage	28	1274	February 10, 2021
How to approximate a noisy spectrum General Usage question	35	1045	June 14, 2023
How to fit a polynomial with a pre-determined degree to a set of points? General Usage regression , fit , curve-fitting , polynomials , approxfun	7	2333	December 27, 2022
Package for fitting a curve and storing the function General Usage	7	740	August 12, 2021
Evaluate integral on many points (Cubature.jl ?) Numerics	44	4813	July 26, 2017

Data approximation with a curve

Related topics