How to fit a normal approximation to data in Julia

lesshaste · February 23, 2020, 9:02am

I have an unnormalized array of counts for a probability mass function (derived from sampling from the true function). The array should be indexed from 0. 12 is the number of times a 0 occurred in the sample. 199 is the number of times a 1 occurred etc.

 counts = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350, 1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]

I want to fit a normal approximation to this. My first attempt is to estimate its standard deviation. I normalize it with

pmf = counts/sum(counts)

But stdm(pmf) is apparently an error.

 ERROR: MethodError: no method matching stdm(::Array{Float64,1})

How can I fit a normal approximation to this data set?

Tamas_Papp · February 23, 2020, 9:20am

Three approaches that occur to me are

matching moments: calculate weighted mean and variance using GitHub - JuliaStats/StatsBase.jl: Basic statistics for Julia (your counts are a weight vector)
the same with some quantile statistics (eg 25-75%)
maximum likelihood, or likelihood-based methods in general: given a mean and a standard deviation you can calculate the log likelihood of each value, counts multiply that

lesshaste · February 23, 2020, 9:35am

Thank you. I am brand new to Julia so could you possibly include some code I can learn from please?

Tamas_Papp · February 23, 2020, 9:48am

Eg the first one:

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using StatsBase

μ, σ = mean_and_std(x, Weights(f); corrected = false)

Tamas_Papp · February 23, 2020, 9:55am

Or the third one (which is, frankly, probably overkill here):

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using DynamicHMC, TransformVariables, Distributions, UnPack, LogDensityProblems
import ForwardDiff, Random

struct NormalModel{TF,TX}
    f::TF
    x::TX
end

function (model::NormalModel)(θ)
    @unpack μ, σ = θ
    @unpack f, x = model
    D = Normal(μ, σ)
    sum(f .* logpdf.(D, x))
end

p = NormalModel(f, x)
t = as((μ = asℝ, σ = asℝ₊))
ℓ = TransformedLogDensity(t, p)
∇ℓ = ADgradient(:ForwardDiff, ℓ)
results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇ℓ, 1000)
posterior = t.(results.chain)
quantile(map(p -> p.μ, posterior), [0.25, 0.75]) # μ

lesshaste · February 23, 2020, 10:32am

Thank you so much and thank you for the third one too.

One question, what is the difference between Using Statistics and Using StatsBase?

Tamas_Papp · February 23, 2020, 10:35am

They are different packages.

lesshaste · February 23, 2020, 10:36am

Ah yes I just found StatsBase vs Statistics - #3 by nalimilan . Apparently StatsBase is due to disappear?

Tamas_Papp · February 23, 2020, 11:11am

That may have been the intention in 2018, but I am not sure if that is still the plan. @nalimilan?

nilshg · February 23, 2020, 9:08pm

Just to add another option: the Distributions package provides a fit method that lets you fit distributions to data, in your case:

julia> f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0];

julia> fit(Normal, f)
Normal{Float64}(μ=500000.0, σ=812968.9963148657)

lesshaste · February 23, 2020, 9:18pm

That’s not fitting the right thing. f is an unnormalised pmf so the values range from 0 to 20. The mean can’t be more than 20.

nilshg · February 23, 2020, 9:23pm

Apologies, should have read your question properly

nilshg · February 23, 2020, 9:48pm

I meant to say:

julia> fit(Normal, [findfirst(x -> x > rand(), cumsum(wts)) for draw in 1:1_000_000])
Normal{Float64}(μ=7.966423, σ=1.2059625139507493)

Edited to add: the above is missing wts = f ./ sum(f)

nalimilan · February 24, 2020, 10:03pm

I’ve made a PR to port lots to things to Statistics, but that raised a few tricky design decisions and then I moved to other things.

abulak · January 6, 2021, 12:40am

Sorry for bumping this old thread, but @Tamas_Papp seems knowledgeable and the topic is very close to what I want to achieve:

I have a random variable X which I can e.g. sample. I know that there is a normalizing transformation for X, namely (µ - X)^t is known to be normally distributed for some (µ, t).
What is the best/simplest/the most precise way of estimating µ and t?
(sampling of X=Xₖ is relatively cheap, of the cost ~ randn(k)).

What I was thinking about is MC → quantile estimation → fitting the appropriate normcdf and deciding (µ,t) based on mse of this approximation, but maybe something better is at hand?

Tamas_Papp · January 6, 2021, 9:09am

Thanks for your kind words, but many others are knowledgeable about these topics here.

It really depends on what features of the distribution you want to match (eg around the mode, or the tails). Doing what you propose should be feasible, but make sure you check relevant moments you care about (eg by simulating from a normal and transforming back).

abulak · January 7, 2021, 8:17am

The tails would be the most preferable, as this is one of the steps in hypothesis testing.

Tamas_Papp · January 7, 2021, 8:36am

It is hard to say more without some context (eg code and formulas for the actual distribution), but this approach may be somewhat tricky to do right — tails are notoriously hard to fit, especially if they differ significantly from normal. Consequently, the result can easily be bogus.

mcreel · January 7, 2021, 10:39am

According to what you write, X is normally distributed with unknown mean and unknown standard deviation. The true unknown mean is mu, and the standard deviation is t. Then (x-mu)/t would be a standard normal random variable (not (mu-x)/t). The maximum likelihood estimator of mu would be the sample mean of the S draws, and the maximum likelihood estimator of sigma would be the square root of the sample variance, after scaling by (S-1)/S. You can make these estimators as precise as you like by increasing S.

Now, if X is not actually normally distributed, then you would need to call on the LLN and the CLT to get to an asymptotic normal distribution. But, you could still estimate mu and sigma in the same way, they just wouldn’t be ML estimators, they would be method of moments estimators.

Actually, if you can sample the R.V., you must know mu and t, no? So, I must be misinterpreting your question.

abulak · January 7, 2021, 2:36pm

@mcreel I wish this were so simple but it’s not (X-µ)*t that is normally distributed, but (µ - X)^t, i.e. this is some form of log-transform with unknown t , which depends on X=Xₖ, i.e. Xₖ form a family of random variables and I have a separate problem for each k and my wish is to compute t = t(k).

What are LLN and CLT you mentioned?

@Tamas_Papp the problem is that I don’t have the distribution, etc formulas, this is what I want to estimate. All I have is the way to sample.

Just in case you’re interested (or for anybody else), here is an in-depth explanation:

The precise problem is the exact computation/estimation of Shapiro-Wilk W-statistics (the one used in normality test). Essentially this statistic is comparing your (sorted) sample with the vector of expected values of order statistics (corrected for the correlation) for the distribution of choice. A generic implementation for the uncorrelated order statistics:
https://github.com/kalmarek/ShapiroWilk/blob/3d617416d5e0b8f6d84953fffe9831c74c8df1e4/src/swcoeffs.jl#L58
(Actually computing inv(cov(OS)) was the hardest part, numerically).

In particular Shapiro-Wilk compares with order statistics of normal distribution and the W-statistic is the correlation between the (uncorrelated) expected values and your sample:

https://github.com/kalmarek/ShapiroWilk/blob/3d617416d5e0b8f6d84953fffe9831c74c8df1e4/src/ShapiroWilk.jl#L18-32

Now to compute p-values, etc. I need to know the distribution of W=Wₖ (which was Xₖ in my first post in this thread, where k is the size of the sample).

Essentially what I’m trying to do is to bring SW-test to XXIst century

Topic		Replies	Views
Fitting a Distribution to existing data Statistics	16	6471	January 5, 2017
Estimate most probable value from data Statistics question	7	2993	June 19, 2019
Package Distributions gives different resullt for standard deviation General Usage	3	668	June 7, 2017
How to fit a normal distribution to some points in the tail Specific Domains	18	1391	February 28, 2020
Multiple linear regression with weights General Usage question , stats	10	1843	December 2, 2017

How to fit a normal approximation to data in Julia

Related topics