# How to fit a normal distribution to some points in the tail

I have some empirical values for the cdf of a distribution which should be close to Normal. These are for x = 1, 2, … , 13. They are as follows:

[5.2909066017292616e-17, 2.998035847356917e-15, 2.893916126231429e-12, 2.893916126231429e-12, 5.984024927027287e-11, 9.828060393355409e-10, 1.3065783612939083e-8, 1.4263214765731293e-7, 1.2913225520247604e-6, 9.765050285646565e-6, 6.197926915196307e-5, 0.00033122740938565793, 0.001493228789321792]

I would like to find a normal distribution that fits these as well as possible. To do this I first defined a cost function and then tried to use optimize. I suspect I am doing it wrong however.

using Distributions
using Optim
empiricalcdf = [5.2909066017292616e-17, 2.998035847356917e-15, 2.893916126231429e-12, 2.893916126231429e-12, 5.984024927027287e-11, 9.828060393355409e-10, 1.3065783612939083e-8, 1.4263214765731293e-7, 1.2913225520247604e-6, 9.765050285646565e-6, 6.197926915196307e-5, 0.00033122740938565793, 0.001493228789321792]
function cdfcost(x)
return sum((empiricalcdf .- cdf.(Normal(x, x), collect(1:13))).^2)
end
print(optimize(cdfcost, [10.0 ,2.0], BFGS()))


This gives me:

 * Status: success

* Candidate solution
Minimizer: [1.00e+01, 2.00e+00]
Minimum:   2.428780e+00

* Found with
Algorithm:     BFGS
Initial Point: [1.00e+01, 2.00e+00]

* Convergence measures
|x - x'|               = 0.00e+00 ≤ 0.0e+00
|x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
|g(x)|                 = 9.24e-01 ≰ 1.0e-08

* Work counters
Seconds run:   0  (vs limit Inf)
Iterations:    1
f(x) calls:    52
∇f(x) calls:   52


This seems to mean no optimization has occurred. What am I doing wrong and is there a simpler way to achieve the same goal?

This seems to work:

μ_hat, σ_hat = optimize(cdfcost, [-100.0, 0.01], [100.0, 10.0], [0.0, 1.0]).minimizer

plot(0.1:0.1:13, [cdf(Normal(μ_hat, σ_hat), x) for x in 0.1:0.1:13])
scatter!(1:13, empiricalcdf)

1 Like

Is the only difference that you have added a box constraint?

Well it also uses a different algorithm and starting point - it looks like this problem is not well behaved? Consider (just picked the first few examples from the docs):

μ_hat, σ_hat = optimize(cdfcost, [-100.0, 0.01], [100.0, 10.0], [0.0, 1.0]).minimizer
μ_hat2, σ_hat2 = optimize(cdfcost, [0.0, 1.0], BFGS()).minimizer
μ_hat3, σ_hat3 = optimize(cdfcost, [1.0, 0.1], LBFGS(); autodiff = :forward).minimizer
μ_hat4, σ_hat4 = optimize(cdfcost, [1.0, 0.1]).minimizer

plot(0.1:0.1:14, [cdf(Normal(μ_hat, σ_hat), x) for x in 0.1:0.1:14])
plot!(0.1:0.1:14, [cdf(Normal(μ_hat2, σ_hat2), x) for x in 0.1:0.1:14])
plot!(0.1:0.1:14, [cdf(Normal(μ_hat3, σ_hat3), x) for x in 0.1:0.1:14])
plot!(0.1:0.1:14, [cdf(Normal(μ_hat4, σ_hat4), x) for x in 0.1:0.1:14])
scatter!(1:13, empiricalcdf)


Maybe @pkofod can shed some light on what’s going on

2 Likes

All that we are doing is varying the mean and variance of a normal. It feels like the optimization should be well behaved.

Not really. The small probabilities don’t identify anything strongly (0 is a very good fit, producing a small distance metric), so you get iso-cost lines.

When in doubt, plot. Eg

using Distributions, PGFPlotsX
empiricalcdf = [5.2909066017292616e-17, 2.998035847356917e-15, 2.893916126231429e-12,
2.893916126231429e-12, 5.984024927027287e-11, 9.828060393355409e-10,
1.3065783612939083e-8, 1.4263214765731293e-7, 1.2913225520247604e-6,
9.765050285646565e-6, 6.197926915196307e-5, 0.00033122740938565793,
0.001493228789321792]
cdfcost(μ, σ) = sum(abs2, empiricalcdf .- cdf.(Normal(μ, σ), collect(1:13)))

μ = range(-5, 5, length = 50)
σ = range(0.1, 2, length = 50)
@pgf Axis({ view = (0, 90), colorbar, "colormap/jet",
xlabel = raw"$\mu$", ylabel = raw"$\sigma$"},
Plot3({ surf, shader = "flat" },
Table(μ, σ, cdfcost.(μ, σ'))))


Try a likelihood-based method.

5 Likes

FWIW, the likelihood I wrote for what I imagine is the same problem should work for the truncated data.

I’m going to just second what @Tamas_Papp has already said. The data are too far into the tail to really be informative. Based on your other-other post, you using a skew normal, that is actually only going to make the problem worse because it is actually a more flexible distribution. That means that tail outcomes are even less informative about the rest of the shape of the distribution. Regardless of the parameters of the distribution, all of the CDFs must approach zero.

This isn’t to say that you CAN’T do what you are doing. Just an explanation for why it isn’t as straightforward as you thought it might be.

EDIT: This is clear from looking at the plots on the wiki page as well. https://en.wikipedia.org/wiki/Skew_normal_distribution

1 Like

To be fair, I didn’t really think it would be straightforward. But I was hoping that with the very high precision of BigFloat optimization might work. I have in mind that 3 points (even from the tail if one has enough precision) is enough to uniquely define a normal distribution. Perhaps 4 is sufficient to uniquely define a skew normal (I don’t have a proof for that).

Thank you so much for all the help you are giving. I hope it’s a fun problem!

I think your conjectures would be correct if indeed the points came from such distributions. But with noisy data, anything goes.

2 Likes

It’s in fact worse than that. My data is in fact not noisy (I have precise rational values for the densities which I convert to BigFloat). It just doesn’t come from precisely a normal distribution. It will be close to a normal and even closer to a skew normal I suspect.

You could try some extreme value techniques perhaps. Fit some tail distribution to the data, then find a full distribution consistent with those estimates. Just a weird thought.

It looks like you are trying to match arbitrary “moments” (in the broad sense) of one distribution to another by solving a nonlinear optimization problem.

Numerical problems aside, I am not sure this is a meaningful exercise: as @tbeason pointed out, the results will be somewhat arbitrary, and very sensitive to minor variations in values (even just floating point noise in the tails) and the moments you use.

Note that the method you are using here does not allow you to evaluate how good the fit is in any way that is statistically meaningful.

Also, if you have precise values from a distribution, then the best fitting distribution is (drumroll) itself. There is nothing special about “named” distribution families, especially skew normal which is not standard at all (there are many different families called “skew normal”, depending on the transformation you use for skewing).

It looks like you have a conceptual, not a computational problem. This may be a good point to step back a bit and reflect on the purpose of the exercise.

Also, if you have precise values from a distribution, then the best fitting distribution is (drumroll) itself.

I have precise values only for x = 1…13. I would like to infer the rest of the distribution and I have a strong belief in what sort of distribution it is.

Thanks for all your help so far. Your MCMC code is particularly intriguing. Although I understand MCMC mathematically I haven’t managed to understand the code yet!

Just to add my two cents: I had often normal distributed data to fit. If you do a probability plot or qq-plot you see that in most cases the tails do not fit. It is extremely rare to have normal distributed data even with very good fits to have fitting tails.

Therefore I wouldn’t even try to fit this. Only if you have some additional information or assumptions eg. expected mean …, you can do something with Bayesian methods …

2 Likes

That’s a very good point. A minor related question, how do you specify convergence criteria when you have box constraints?

Maybe this is more for others who land on this thread in pursuit of an answer to their own question, Distributions.jl has their own distribution fitting.

1 Like

I don’t think that works for this particular case — AFAICT none of the fit_mle methods support truncated normals.

Also, I think it is important to emphasize that this exercise has a conceptual problem, which should be tackled before the solution is attempted. Coding these methods in Julia is relatively easy after that.

I may have the terminology wrong but I don’t believe I have a truncated normal. I have simply been told the densities of a normal at 13 points in the tail and I have to work out what the full distribution is (i.e. the mean and variance). I should say the optimization approach works ok but I just have to do a grid search to find the minimum error where I would have expected Optim to be able to do something cleverer.