How to fit a normal approximation to data in Julia

I have an unnormalized array of counts for a probability mass function (derived from sampling from the true function). The array should be indexed from 0. 12 is the number of times a 0 occurred in the sample. 199 is the number of times a 1 occurred etc.

 counts = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350, 1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0] 

I want to fit a normal approximation to this. My first attempt is to estimate its standard deviation. I normalize it with

pmf = counts/sum(counts)  

But stdm(pmf) is apparently an error.

 ERROR: MethodError: no method matching stdm(::Array{Float64,1})

How can I fit a normal approximation to this data set?

Three approaches that occur to me are

  1. matching moments: calculate weighted mean and variance using GitHub - JuliaStats/StatsBase.jl: Basic statistics for Julia (your counts are a weight vector)

  2. the same with some quantile statistics (eg 25-75%)

  3. maximum likelihood, or likelihood-based methods in general: given a mean and a standard deviation you can calculate the log likelihood of each value, counts multiply that

1 Like

Thank you. I am brand new to Julia so could you possibly include some code I can learn from please?

Eg the first one:

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using StatsBase

μ, σ = mean_and_std(x, Weights(f); corrected = false)
5 Likes

Or the third one (which is, frankly, probably overkill here):

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using DynamicHMC, TransformVariables, Distributions, UnPack, LogDensityProblems
import ForwardDiff, Random

struct NormalModel{TF,TX}
    f::TF
    x::TX
end

function (model::NormalModel)(θ)
    @unpack μ, σ = θ
    @unpack f, x = model
    D = Normal(μ, σ)
    sum(f .* logpdf.(D, x))
end

p = NormalModel(f, x)
t = as((μ = asℝ, σ = asℝ₊))
ℓ = TransformedLogDensity(t, p)
∇ℓ = ADgradient(:ForwardDiff, ℓ)
results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇ℓ, 1000)
posterior = t.(results.chain)
quantile(map(p -> p.μ, posterior), [0.25, 0.75]) # μ
3 Likes

Thank you so much and thank you for the third one too.

One question, what is the difference between Using Statistics and Using StatsBase?

They are different packages.

Ah yes I just found StatsBase vs Statistics - #3 by nalimilan . Apparently StatsBase is due to disappear?

That may have been the intention in 2018, but I am not sure if that is still the plan. @nalimilan?

1 Like

Just to add another option: the Distributions package provides a fit method that lets you fit distributions to data, in your case:

julia> f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0];

julia> fit(Normal, f)
Normal{Float64}(μ=500000.0, σ=812968.9963148657)
1 Like

That’s not fitting the right thing. f is an unnormalised pmf so the values range from 0 to 20. The mean can’t be more than 20.

Apologies, should have read your question properly

I meant to say:

julia> fit(Normal, [findfirst(x -> x > rand(), cumsum(wts)) for draw in 1:1_000_000])
Normal{Float64}(μ=7.966423, σ=1.2059625139507493)

Edited to add: the above is missing wts = f ./ sum(f)

5 Likes

I’ve made a PR to port lots to things to Statistics, but that raised a few tricky design decisions and then I moved to other things.

1 Like

Sorry for bumping this old thread, but @Tamas_Papp seems knowledgeable and the topic is very close to what I want to achieve:

I have a random variable X which I can e.g. sample. I know that there is a normalizing transformation for X, namely (µ - X)^t is known to be normally distributed for some (µ, t).
What is the best/simplest/the most precise way of estimating µ and t?
(sampling of X=Xₖ is relatively cheap, of the cost ~ randn(k)).

What I was thinking about is MC → quantile estimation → fitting the appropriate normcdf and deciding (µ,t) based on mse of this approximation, but maybe something better is at hand?

Thanks for your kind words, but many others are knowledgeable about these topics here.

It really depends on what features of the distribution you want to match (eg around the mode, or the tails). Doing what you propose should be feasible, but make sure you check relevant moments you care about (eg by simulating from a normal and transforming back).

The tails would be the most preferable, as this is one of the steps in hypothesis testing.

It is hard to say more without some context (eg code and formulas for the actual distribution), but this approach may be somewhat tricky to do right — tails are notoriously hard to fit, especially if they differ significantly from normal. Consequently, the result can easily be bogus.

According to what you write, X is normally distributed with unknown mean and unknown standard deviation. The true unknown mean is mu, and the standard deviation is t. Then (x-mu)/t would be a standard normal random variable (not (mu-x)/t). The maximum likelihood estimator of mu would be the sample mean of the S draws, and the maximum likelihood estimator of sigma would be the square root of the sample variance, after scaling by (S-1)/S. You can make these estimators as precise as you like by increasing S.

Now, if X is not actually normally distributed, then you would need to call on the LLN and the CLT to get to an asymptotic normal distribution. But, you could still estimate mu and sigma in the same way, they just wouldn’t be ML estimators, they would be method of moments estimators.

Actually, if you can sample the R.V., you must know mu and t, no? So, I must be misinterpreting your question.

2 Likes

@mcreel I wish this were so simple :wink: but it’s not (X-µ)*t that is normally distributed, but (µ - X)^t, i.e. this is some form of log-transform with unknown t , which depends on X=Xₖ, i.e. Xₖ form a family of random variables and I have a separate problem for each k and my wish is to compute t = t(k).

What are LLN and CLT you mentioned?

@Tamas_Papp the problem is that I don’t have the distribution, etc formulas, this is what I want to estimate. All I have is the way to sample.


Just in case you’re interested (or for anybody else), here is an in-depth explanation:

The precise problem is the exact computation/estimation of Shapiro-Wilk W-statistics (the one used in normality test). Essentially this statistic is comparing your (sorted) sample with the vector of expected values of order statistics (corrected for the correlation) for the distribution of choice. A generic implementation for the uncorrelated order statistics:
https://github.com/kalmarek/ShapiroWilk/blob/3d617416d5e0b8f6d84953fffe9831c74c8df1e4/src/swcoeffs.jl#L58
(Actually computing inv(cov(OS)) was the hardest part, numerically).

In particular Shapiro-Wilk compares with order statistics of normal distribution and the W-statistic is the correlation between the (uncorrelated) expected values and your sample:

https://github.com/kalmarek/ShapiroWilk/blob/3d617416d5e0b8f6d84953fffe9831c74c8df1e4/src/ShapiroWilk.jl#L18-32

Now to compute p-values, etc. I need to know the distribution of W=Wₖ (which was Xₖ in my first post in this thread, where k is the size of the sample).

Essentially what I’m trying to do is to bring SW-test to XXIst century :wink: