How to fit a normal approximation to data in Julia

I have an unnormalized array of counts for a probability mass function (derived from sampling from the true function). The array should be indexed from 0. 12 is the number of times a 0 occurred in the sample. 199 is the number of times a 1 occurred etc.

 counts = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350, 1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0] 

I want to fit a normal approximation to this. My first attempt is to estimate its standard deviation. I normalize it with

pmf = counts/sum(counts)  

But stdm(pmf) is apparently an error.

 ERROR: MethodError: no method matching stdm(::Array{Float64,1})

How can I fit a normal approximation to this data set?

Three approaches that occur to me are

  1. matching moments: calculate weighted mean and variance using https://github.com/JuliaStats/StatsBase.jl (your counts are a weight vector)

  2. the same with some quantile statistics (eg 25-75%)

  3. maximum likelihood, or likelihood-based methods in general: given a mean and a standard deviation you can calculate the log likelihood of each value, counts multiply that

Thank you. I am brand new to Julia so could you possibly include some code I can learn from please?

Eg the first one:

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using StatsBase

μ, σ = mean_and_std(x, Weights(f); corrected = false)
2 Likes

Or the third one (which is, frankly, probably overkill here):

f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0]
x = 0:(length(counts)-1)

using DynamicHMC, TransformVariables, Distributions, UnPack, LogDensityProblems
import ForwardDiff, Random

struct NormalModel{TF,TX}
    f::TF
    x::TX
end

function (model::NormalModel)(θ)
    @unpack μ, σ = θ
    @unpack f, x = model
    D = Normal(μ, σ)
    sum(f .* logpdf.(D, x))
end

p = NormalModel(f, x)
t = as((μ = asℝ, σ = asℝ₊))
ℓ = TransformedLogDensity(t, p)
∇ℓ = ADgradient(:ForwardDiff, ℓ)
results = mcmc_with_warmup(Random.GLOBAL_RNG, ∇ℓ, 1000)
posterior = t.(results.chain)
quantile(map(p -> p.μ, posterior), [0.25, 0.75]) # μ
2 Likes

Thank you so much and thank you for the third one too.

One question, what is the difference between Using Statistics and Using StatsBase?

They are different packages.

Ah yes I just found StatsBase vs Statistics . Apparently StatsBase is due to disappear?

That may have been the intention in 2018, but I am not sure if that is still the plan. @nalimilan?

1 Like

Just to add another option: the Distributions package provides a fit method that lets you fit distributions to data, in your case:

julia> f = [12, 199, 3125, 29616, 175426, 662401, 1608671, 2477850, 2441350,
     1580290, 709625, 235426, 60899, 12643, 2126, 296, 41, 4, 0, 0];

julia> fit(Normal, f)
Normal{Float64}(μ=500000.0, σ=812968.9963148657)

That’s not fitting the right thing. f is an unnormalised pmf so the values range from 0 to 20. The mean can’t be more than 20.

Apologies, should have read your question properly

I meant to say:

julia> fit(Normal, [findfirst(x -> x > rand(), cumsum(wts)) for draw in 1:1_000_000])
Normal{Float64}(μ=7.966423, σ=1.2059625139507493)

Edited to add: the above is missing wts = f ./ sum(f)

1 Like

I’ve made a PR to port lots to things to Statistics, but that raised a few tricky design decisions and then I moved to other things.

1 Like