Chi-Square test of a sample

Hello!

I am working on a project in which I was given a random sample, and now I want to identify, among 3 distributions, which one created this sample.

By plotting a histogram, it seems that the gamma distribution is the one that performs the best fit, but I would also like to perform a chi-square test.

I am guessing that, maybe, I can use the ChisqTest function from HypothesisTests package, but I am not sure how to do that…

Also, the values in my random samples are all Float numbers, so I am not sure if I can use this function.

Any tips on how I can do that?

You can perform the goodness of fit chisq test which need two vectors.

As an aside, a χ^2 is a Γ distribution though not necessarily vice versa.

1 Like

Not a direct answer, but just in case, this e-handbook page explains quite clearly the steps involved when performing a Chi-Square Goodness-of-Fit Test, which should be useful before using the ChisqTest function from HypothesisTests.jl.

I typically use a non-parametric test in this case. For example:

using Distributions
using HypothesisTests
using StatsPlots

# generate some fake data (note that they come from a Gamma distribution)
data = rand(Gamma(7.5, 1.0), 75)

# fit a few distributions that I suspect may model the data well
gamma = fit(Gamma, data)
lognormal = fit(LogNormal, data)
rayleigh = fit(Rayleigh, data)

# Use quantile-quantile plots to visually inspect the goodness of fit
qqplot(data,gamma)
qqplot(data,lognormal)
qqplot(data,rayleigh)

# perform a Kolmogorov-Smirnov test (or Anderson-Darling) 
ExactOneSampleKSTest(data, gamma)
ExactOneSampleKSTest(data, lognormal)
ExactOneSampleKSTest(data, rayleigh)
6 Likes

The following Chi-Square tests seem to work in investigating which of the different input data samples follow a Gamma probability distribution.

One of the references used was this presentation by Dr. Wolfgang Rolke.

HypothesisTests_asymmetric_distributions

using Random, Distributions
using StatsBase, HypothesisTests

n = 1000     # number of data points in sample
nbins = 10   # equi-probable bins will be computed via quantiles

# Draw random samples from asymmetric distributions:
Y1 = rand(LogNormal(), n)       # lognormal distribution
Y2 = rand(Gamma(), n)           # Gamma distribution
Y3 = rand(Chi(1), n)            # Chi(1) distribution

# Test Gamma on Y1:
bins1 = quantile(Y1, LinRange(1/nbins,1,nbins))
h1 = fit(Histogram, Y1, bins1)
fd1 = fit(Gamma,Y1)
theta1 = diff(cdf.(fd1,bins1))
ChisqTest(h1.weights, theta1/sum(theta1))  # Χ² = 34  -> Reject H0

# Test Gamma on Y2:
bins2 = quantile(Y2, LinRange(1/nbins,1,nbins))
h2 = fit(Histogram, Y2, bins2)
fd2 = fit(Gamma,Y2)
theta2 = diff(cdf.(fd2,bins2))
ChisqTest(h2.weights, theta2/sum(theta2))  # Χ² = 4.7  -> Fail to reject H0

# Test Gamma on Y3:
bins3 = quantile(Y3, LinRange(1/nbins,1,nbins))
h3 = fit(Histogram, Y3, bins3)
fd3 = fit(Gamma,Y3)
theta3 = diff(cdf.(fd3,bins3))
ChisqTest(h3.weights, theta3/sum(theta3))  #  Χ² = 29.6  -> Reject H0

NB: a word of caution, produced by a non-statistician, just a Julia user

Suppose the i.i.d. sample Y is fitted with models A, B, and C.

In order to determine which model’s prediction is likely to be closer to the true (but unknown) distribution that generated the original sample Y, one of the standard prescriptions is to use the AIC. Select the model with the smallest AIC.

Ref. Akaike information criterion - Wikipedia

In the following example, models A, B, and C are respectively Gamma, Weibull, and LogNormal. I have experimented with three true distributions Gamma(5, 2), Weibull(2.4, 11.3), and LogNormal(2.2, 0.47). These are similar each others. Please look at the graph in the following example.

Results of a Monte Carlo simulation (10^4 iterations) with samples Y generated by Gamma(5, 2) of size 100:

     Gamma => 0.6692
   Weibull => 0.1409
 LogNormal => 0.1899

The percentage of the correct answer, Gamma, being selected is about 67 percent.

In the cases of Weibull(2.4, 11.3) and LogNormal(2.2, 0.47), the percentages of the correct answers are about 86 percent and about 81 percent, respectively.

For details, see the following code and results.

using Distributions, StatsPlots

function aic(model, Y)
    mle = fit_mle(model, Y)
    -2loglikelihood(mle, Y) + 2length(params(mle))
end

function simulate_model_selections(models, truedist, samplesize; niters = 10^4)
    selectedmodel = Vector{Int}(undef, niters)
    Threads.@threads for i in 1:niters
        Y = rand(truedist, samplesize)
        selectedmodel[i] = argmin(aic.(models, Ref(Y)))
    end
    nselected = zeros(Int, 3)
    for i in 1:niters
        nselected[selectedmodel[i]] += 1
    end
    [model => nselected[i]/niters for (i, model) in enumerate(models)]
end

models = (Gamma, Weibull, LogNormal)

a, b = 0, 40
plot(Gamma(5, 2), a, b; label="Gamma(5, 2)")
plot!(Weibull(2.4, 11.3), a, b; label="Weibull(2.4, 11.3)", ls=:dash)
plot!(LogNormal(2.2, 0.47), a, b; label="LogNormal(2.2, 0.47)", ls=:dashdot)

2021-07-13

simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.6692
   Weibull => 0.1409
 LogNormal => 0.1899
simulate_model_selections(models, Weibull(2.4, 11.3), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.1427
   Weibull => 0.856
 LogNormal => 0.0013
simulate_model_selections(models, LogNormal(2.2, 0.47), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.1914
   Weibull => 0.0034
 LogNormal => 0.8052

Pluto notebook: Simulation of AIC model selection - Pluto notebook.jl

You can execute in Pluto by copy-and-pasting the URL into “Open from file:” input form and click “Open” button.

2021-07-13 (8)

Edit 1: I’m sorry. The original code works with Julia ≥ v1.7.0-beta. The edited code above works also on Julia v1.6.1.

Edit 2: Link to Pluto notebook.

simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ .\task.jl:322 [inlined]
 [2] threading_run(func::Function)
   @ Base.Threads .\threadingconstructs.jl:34
 [3] macro expansion
   @ .\threadingconstructs.jl:93 [inlined]
 [4] simulate_model_selections(models::Tuple{UnionAll, UnionAll, UnionAll}, truedist::Gamma{Float64}, samplesize::Int64; niters::Int64)
   @ Main .\REPL[50]:3
 [5] top-level scope
   @ REPL[51]:1

    nested task error: MethodError: no method matching keys(::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
    Closest candidates are:
      keys(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at C:\Users\Hermesr\.julia\packages\Tables\gg6Id\src\Tables.jl:181
      keys(::Missings.SkipMissings{V, T} where {V<:AbstractArray, T<:Tuple{Vararg{AbstractArray, N} where N}})
at C:\Users\Hermesr\.julia\packages\Missings\sx5js\src\Missings.jl:354
      keys(::Missings.EachReplaceMissing) at C:\Users\Hermesr\.julia\packages\Missings\sx5js\src\Missings.jl:94      ...
    Stacktrace:
     [1] pairs(collection::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
       @ Base .\abstractdict.jl:138
     [2] _findmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}}, #unused#::Colon)
       @ Base .\array.jl:2284
     [3] findmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
       @ Base .\array.jl:2281
     [4] argmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
       @ Base .\array.jl:2346
     [5] macro expansion
       @ .\REPL[50]:5 [inlined]
     [6] (::var"#25#threadsfor_fun#14"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})(onethread::Bool)
       @ Main .\threadingconstructs.jl:81
     [7] (::var"#25#threadsfor_fun#14"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})()
       @ Main .\threadingconstructs.jl:48

Oh, I’m sorry.

I have corrected the code above to work on Julia v1.6.1.

I usually use the nightly build because I want to enjoy the latest features implemented by the developers. Though the nightly build is sometimes pretty much broken, I enjoy it as proof that the developers are taking on a challenging task.

simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ .\task.jl:322 [inlined]
 [2] threading_run(func::Function)
   @ Base.Threads .\threadingconstructs.jl:34
 [3] macro expansion
   @ .\threadingconstructs.jl:93 [inlined]
 [4] simulate_model_selections(models::Tuple{UnionAll, UnionAll, UnionAll}, truedist::Gamma{Float64}, samplesize::Int64; niters::Int64)
   @ Main .\REPL[11]:3
 [5] top-level scope
   @ REPL[12]:1

    nested task error: UndefVarError: mle not defined
    Stacktrace:
      [1] aic(model::Type, Y::Vector{Float64})
        @ Main .\REPL[10]:2
      [2] _broadcast_getindex_evalf
        @ .\broadcast.jl:648 [inlined]
      [3] _broadcast_getindex
        @ .\broadcast.jl:621 [inlined]
      [4] (::Base.Broadcast.var"#19#20"{Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple}, Nothing, typeof(aic), Tuple{Tuple{UnionAll, UnionAll, UnionAll}, Base.RefValue{Vector{Float64}}}}})(k::Int64)
        @ Base.Broadcast .\broadcast.jl:1098
      [5] ntuple
        @ .\ntuple.jl:50 [inlined]
      [6] copy
        @ .\broadcast.jl:1098 [inlined]
      [7] materialize
        @ .\broadcast.jl:883 [inlined]
      [8] macro expansion
        @ .\REPL[11]:5 [inlined]
      [9] (::var"#25#threadsfor_fun#11"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})(onethread::Bool)
        @ Main .\threadingconstructs.jl:81
     [10] (::var"#25#threadsfor_fun#11"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})()
        @ Main .\threadingconstructs.jl:48

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)

The following is how it works fine in my environment.

Pluto notebook version:

2 Likes

Why can’t I update the package to the desired version?

>st
      Status `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
  [31c24e10] Distributions v0.24.18
>up Distributions@0.25.10
    Updating registry at `C:\Users\Hermesr\.julia\registries\General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
  No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`

>st Distributions
      Status `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
  [31c24e10] Distributions v0.24.18

I think you mean add Distributions@0.25.10?

(@v1.6) pkg> rm Distributions
    Updating `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
  [31c24e10] - Distributions v0.24.18
  No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`

(@v1.6) pkg> add Distributions@0.25.10
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distributions [31c24e10]:
 Distributions [31c24e10] log:
 ├─possible versions are: 0.16.0-0.25.10 or uninstalled
 ├─restricted to versions 0.25.10 by an explicit requirement, leaving only versions 0.25.10
 └─restricted by compatibility requirements with Gadfly [c91e804a] to versions: 0.16.0-0.24.18 — no versions left
   └─Gadfly [c91e804a] log:
     ├─possible versions are: 0.8.0-1.3.3 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.8.0-1.3.3

(@v1.6) pkg> add Distributions
   Resolving package versions...
    Updating `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
  [31c24e10] + Distributions v0.24.18
  No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`

So there’s your answer!

(NB you don’t have to rm if you want to install a specific version, add Package@version essentially does what you thought up Package@version would do)

My intention was to upgrade the Distribution package to version 0.25.10.
Think that, with up Distributions@0.25.10. that goal was achieved.
Then I do add Distributions@0.25.10. And I present the following alert. I don’t know how to solve it!:

add Distributions@0.25.10
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distributions [31c24e10]:
 Distributions [31c24e10] log:
 ├─possible versions are: 0.16.0-0.25.10 or uninstalled
 ├─restricted to versions 0.25.10 by an explicit requirement, leaving only versions 0.25.10
 └─restricted by compatibility requirements with Gadfly [c91e804a] to versions: 0.16.0-0.24.18 — no versions left
   └─Gadfly [c91e804a] log:
     ├─possible versions are: 0.8.0-1.3.3 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.8.0-1.3.3

I understood that, and the error message tells you pretty explicitly what’s going on - Gadfly is only compatible with Distributions up to 0.24, so you cannot install Gadfly and Distributions@0.25 in the same environment. I feel like I’ve given you this advice before, but to reiterate you generally shouldn’t have all your packages in the default (@v1.6) environment, but rather work in project specific environments containing only the packages you actually need for whatever you’re currently trying to do.

2 Likes
julia> simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.6696
   Weibull => 0.1463
 LogNormal => 0.1841

julia> simulate_model_selections(models, Weibull(2.4, 11.3), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.144
   Weibull => 0.8538
 LogNormal => 0.0022

julia> simulate_model_selections(models, LogNormal(2.2, 0.47), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
     Gamma => 0.1836
   Weibull => 0.0035
 LogNormal => 0.8129

(Julia Statistics) pkg> st
      Status `C:\projects\Julia Statistics\Project.toml`
  [31c24e10] Distributions v0.25.10
  [91a5bcdd] Plots v1.19.0
  [f3b207a7] StatsPlots v0.14.25

(Julia Statistics) pkg> 

I assume you’re saying it’s working now, in which case glad to see that it is!

Thanks everybody for the great answers!!

1 Like