# Chi-Square test of a sample

Hello!

I am working on a project in which I was given a random sample, and now I want to identify, among 3 distributions, which one created this sample.

By plotting a histogram, it seems that the gamma distribution is the one that performs the best fit, but I would also like to perform a chi-square test.

I am guessing that, maybe, I can use the `ChisqTest` function from `HypothesisTests` package, but I am not sure how to do that…

Also, the values in my random samples are all Float numbers, so I am not sure if I can use this function.

Any tips on how I can do that?

You can perform the goodness of fit chisq test which need two vectors.

As an aside, a χ^2 is a Γ distribution though not necessarily vice versa.

1 Like

Not a direct answer, but just in case, this e-handbook page explains quite clearly the steps involved when performing a Chi-Square Goodness-of-Fit Test, which should be useful before using the `ChisqTest` function from `HypothesisTests.jl`.

I typically use a non-parametric test in this case. For example:

``````using Distributions
using HypothesisTests
using StatsPlots

# generate some fake data (note that they come from a Gamma distribution)
data = rand(Gamma(7.5, 1.0), 75)

# fit a few distributions that I suspect may model the data well
gamma = fit(Gamma, data)
lognormal = fit(LogNormal, data)
rayleigh = fit(Rayleigh, data)

# Use quantile-quantile plots to visually inspect the goodness of fit
qqplot(data,gamma)
qqplot(data,lognormal)
qqplot(data,rayleigh)

# perform a Kolmogorov-Smirnov test (or Anderson-Darling)
ExactOneSampleKSTest(data, gamma)
ExactOneSampleKSTest(data, lognormal)
ExactOneSampleKSTest(data, rayleigh)
``````
6 Likes

The following Chi-Square tests seem to work in investigating which of the different input data samples follow a Gamma probability distribution.

One of the references used was this presentation by Dr. Wolfgang Rolke.

``````using Random, Distributions
using StatsBase, HypothesisTests

n = 1000     # number of data points in sample
nbins = 10   # equi-probable bins will be computed via quantiles

# Draw random samples from asymmetric distributions:
Y1 = rand(LogNormal(), n)       # lognormal distribution
Y2 = rand(Gamma(), n)           # Gamma distribution
Y3 = rand(Chi(1), n)            # Chi(1) distribution

# Test Gamma on Y1:
bins1 = quantile(Y1, LinRange(1/nbins,1,nbins))
h1 = fit(Histogram, Y1, bins1)
fd1 = fit(Gamma,Y1)
theta1 = diff(cdf.(fd1,bins1))
ChisqTest(h1.weights, theta1/sum(theta1))  # Χ² = 34  -> Reject H0

# Test Gamma on Y2:
bins2 = quantile(Y2, LinRange(1/nbins,1,nbins))
h2 = fit(Histogram, Y2, bins2)
fd2 = fit(Gamma,Y2)
theta2 = diff(cdf.(fd2,bins2))
ChisqTest(h2.weights, theta2/sum(theta2))  # Χ² = 4.7  -> Fail to reject H0

# Test Gamma on Y3:
bins3 = quantile(Y3, LinRange(1/nbins,1,nbins))
h3 = fit(Histogram, Y3, bins3)
fd3 = fit(Gamma,Y3)
theta3 = diff(cdf.(fd3,bins3))
ChisqTest(h3.weights, theta3/sum(theta3))  #  Χ² = 29.6  -> Reject H0
``````

NB: a word of caution, produced by a non-statistician, just a Julia user

Suppose the i.i.d. sample `Y` is fitted with models A, B, and C.

In order to determine which model’s prediction is likely to be closer to the true (but unknown) distribution that generated the original sample `Y`, one of the standard prescriptions is to use the AIC. Select the model with the smallest AIC.

In the following example, models A, B, and C are respectively `Gamma`, `Weibull`, and `LogNormal`. I have experimented with three true distributions `Gamma(5, 2)`, `Weibull(2.4, 11.3)`, and `LogNormal(2.2, 0.47)`. These are similar each others. Please look at the graph in the following example.

Results of a Monte Carlo simulation (10^4 iterations) with samples `Y` generated by `Gamma(5, 2)` of size 100:

``````     Gamma => 0.6692
Weibull => 0.1409
LogNormal => 0.1899
``````

The percentage of the correct answer, `Gamma`, being selected is about 67 percent.

In the cases of `Weibull(2.4, 11.3)` and `LogNormal(2.2, 0.47)`, the percentages of the correct answers are about 86 percent and about 81 percent, respectively.

For details, see the following code and results.

``````using Distributions, StatsPlots

function aic(model, Y)
mle = fit_mle(model, Y)
-2loglikelihood(mle, Y) + 2length(params(mle))
end

function simulate_model_selections(models, truedist, samplesize; niters = 10^4)
selectedmodel = Vector{Int}(undef, niters)
Y = rand(truedist, samplesize)
selectedmodel[i] = argmin(aic.(models, Ref(Y)))
end
nselected = zeros(Int, 3)
for i in 1:niters
nselected[selectedmodel[i]] += 1
end
[model => nselected[i]/niters for (i, model) in enumerate(models)]
end

models = (Gamma, Weibull, LogNormal)

a, b = 0, 40
plot(Gamma(5, 2), a, b; label="Gamma(5, 2)")
plot!(Weibull(2.4, 11.3), a, b; label="Weibull(2.4, 11.3)", ls=:dash)
plot!(LogNormal(2.2, 0.47), a, b; label="LogNormal(2.2, 0.47)", ls=:dashdot)
``````

``````simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
``````
``````3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.6692
Weibull => 0.1409
LogNormal => 0.1899
``````
``````simulate_model_selections(models, Weibull(2.4, 11.3), 100; niters = 10^4)
``````
``````3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.1427
Weibull => 0.856
LogNormal => 0.0013
``````
``````simulate_model_selections(models, LogNormal(2.2, 0.47), 100; niters = 10^4)
``````
``````3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.1914
Weibull => 0.0034
LogNormal => 0.8052
``````

Pluto notebook: Simulation of AIC model selection - Pluto notebook.jl

You can execute in Pluto by copy-and-pasting the URL into “Open from file:” input form and click “Open” button.

Edit 1: I’m sorry. The original code works with Julia ≥ v1.7.0-beta. The edited code above works also on Julia v1.6.1.

Edit 2: Link to Pluto notebook.

``````simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
Stacktrace:
[1] wait
[3] macro expansion
[4] simulate_model_selections(models::Tuple{UnionAll, UnionAll, UnionAll}, truedist::Gamma{Float64}, samplesize::Int64; niters::Int64)
@ Main .\REPL[50]:3
[5] top-level scope
@ REPL[51]:1

nested task error: MethodError: no method matching keys(::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
Closest candidates are:
keys(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at C:\Users\Hermesr\.julia\packages\Tables\gg6Id\src\Tables.jl:181
keys(::Missings.SkipMissings{V, T} where {V<:AbstractArray, T<:Tuple{Vararg{AbstractArray, N} where N}})
at C:\Users\Hermesr\.julia\packages\Missings\sx5js\src\Missings.jl:354
keys(::Missings.EachReplaceMissing) at C:\Users\Hermesr\.julia\packages\Missings\sx5js\src\Missings.jl:94      ...
Stacktrace:
[1] pairs(collection::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
@ Base .\abstractdict.jl:138
[2] _findmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}}, #unused#::Colon)
@ Base .\array.jl:2284
[3] findmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
@ Base .\array.jl:2281
[4] argmin(a::Base.Generator{Tuple{UnionAll, UnionAll, UnionAll}, var"#12#15"{Vector{Float64}}})
@ Base .\array.jl:2346
[5] macro expansion
@ .\REPL[50]:5 [inlined]
[7] (::var"#25#threadsfor_fun#14"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})()
``````

Oh, I’m sorry.

I have corrected the code above to work on Julia v1.6.1.

I usually use the nightly build because I want to enjoy the latest features implemented by the developers. Though the nightly build is sometimes pretty much broken, I enjoy it as proof that the developers are taking on a challenging task.

``````simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
Stacktrace:
[1] wait
[3] macro expansion
[4] simulate_model_selections(models::Tuple{UnionAll, UnionAll, UnionAll}, truedist::Gamma{Float64}, samplesize::Int64; niters::Int64)
@ Main .\REPL[11]:3
[5] top-level scope
@ REPL[12]:1

nested task error: UndefVarError: mle not defined
Stacktrace:
[1] aic(model::Type, Y::Vector{Float64})
@ Main .\REPL[10]:2
[5] ntuple
@ .\ntuple.jl:50 [inlined]
[6] copy
[7] materialize
[8] macro expansion
@ .\REPL[11]:5 [inlined]
[10] (::var"#25#threadsfor_fun#11"{Tuple{UnionAll, UnionAll, UnionAll}, Gamma{Float64}, Int64, Vector{Int64}, UnitRange{Int64}})()

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
``````

The following is how it works fine in my environment.

Pluto notebook version:

2 Likes

Why can’t I update the package to the desired version?

``````>st
Status `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
[31c24e10] Distributions v0.24.18
>up Distributions@0.25.10
Updating registry at `C:\Users\Hermesr\.julia\registries\General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`

>st Distributions
Status `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
[31c24e10] Distributions v0.24.18
``````

I think you mean `add Distributions@0.25.10`?

``````(@v1.6) pkg> rm Distributions
Updating `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
[31c24e10] - Distributions v0.24.18
No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`

Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distributions [31c24e10]:
Distributions [31c24e10] log:
├─possible versions are: 0.16.0-0.25.10 or uninstalled
├─restricted to versions 0.25.10 by an explicit requirement, leaving only versions 0.25.10
└─restricted by compatibility requirements with Gadfly [c91e804a] to versions: 0.16.0-0.24.18 — no versions left
├─possible versions are: 0.8.0-1.3.3 or uninstalled
└─restricted to versions * by an explicit requirement, leaving only versions 0.8.0-1.3.3

Resolving package versions...
Updating `C:\Users\Hermesr\.julia\environments\v1.6\Project.toml`
[31c24e10] + Distributions v0.24.18
No Changes to `C:\Users\Hermesr\.julia\environments\v1.6\Manifest.toml`
``````

(NB you don’t have to `rm` if you want to install a specific version, `add Package@version` essentially does what you thought `up Package@version` would do)

My intention was to upgrade the Distribution package to version 0.25.10.
Think that, with up Distributions@0.25.10. that goal was achieved.
Then I do add Distributions@0.25.10. And I present the following alert. I don’t know how to solve it!:

``````add Distributions@0.25.10
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distributions [31c24e10]:
Distributions [31c24e10] log:
├─possible versions are: 0.16.0-0.25.10 or uninstalled
├─restricted to versions 0.25.10 by an explicit requirement, leaving only versions 0.25.10
└─restricted by compatibility requirements with Gadfly [c91e804a] to versions: 0.16.0-0.24.18 — no versions left
├─possible versions are: 0.8.0-1.3.3 or uninstalled
└─restricted to versions * by an explicit requirement, leaving only versions 0.8.0-1.3.3
``````

I understood that, and the error message tells you pretty explicitly what’s going on - Gadfly is only compatible with Distributions up to 0.24, so you cannot install Gadfly and Distributions@0.25 in the same environment. I feel like I’ve given you this advice before, but to reiterate you generally shouldn’t have all your packages in the default (`@v1.6`) environment, but rather work in project specific environments containing only the packages you actually need for whatever you’re currently trying to do.

2 Likes
``````julia> simulate_model_selections(models, Gamma(5, 2), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.6696
Weibull => 0.1463
LogNormal => 0.1841

julia> simulate_model_selections(models, Weibull(2.4, 11.3), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.144
Weibull => 0.8538
LogNormal => 0.0022

julia> simulate_model_selections(models, LogNormal(2.2, 0.47), 100; niters = 10^4)
3-element Vector{Pair{UnionAll, Float64}}:
Gamma => 0.1836
Weibull => 0.0035
LogNormal => 0.8129

(Julia Statistics) pkg> st
Status `C:\projects\Julia Statistics\Project.toml`
[31c24e10] Distributions v0.25.10
[91a5bcdd] Plots v1.19.0
[f3b207a7] StatsPlots v0.14.25

(Julia Statistics) pkg>

``````

I assume you’re saying it’s working now, in which case glad to see that it is!

Thanks everybody for the great answers!!

1 Like