Additional details
The chi-square test can only be used if the distribution of testValues is approximated by the chi-square distribution.
In the standard goodness-of-fit test, the degree of freedom of the chi-square distribution should be equal to (the number of elements in the support of the discrete distribution to be tested) - 1.
However, the real (or substantial) support of Binomial(100, 0.7) is at most between 40 and 100. So, in this case, we can expect that the degree of freedom should be roughly around 60.
Let’s check this in the following.
The mathematically idealized support of DiscreteUniform(0, 100) and Binomial(100, 0.7) are both 0:100, but the real support of the latter is much smaller than the former.
using Distributions, StatsPlots
P = plot(DiscreteUniform(0, 100); label="DiscreteUniform(0, 100)", xtick=0:10:100, ylim=(0, 0.013))
Q = plot(Binomial(100, 0.7); label="Binomial(100, 0.7)", legend=:topleft, xtick=0:10:100)
plot(P, Q; size=(800, 250))
using Distributions, Random, StatsPlots
# The test definitions.....
function computeDensity(data, supp)
counts = [count(i -> i==s,data) for s in supp]
if length(data) > sum(counts)
error("There are some data not in the support !")
end
return counts
end
"""
Modified version of `goodnessOfFitDiscrete` function in
https://discourse.julialang.org/t/whats-wrong-with-my-chi-squared-goodness-of-fits-tests/64334
* The old argument `support` is deleted.
* The new local variable `supp` is defined to be `support(f₀)`
"""
function goodnessOfFitDiscrete(data, f₀; compressedData=true, α=0.05, d=0)
supp = support(f₀)
if !compressedData
data = computeDensity(data, supp)
end
K = length(supp)
N = sum(data)
p̂ = data ./ N
df = K - d - 1
p0 = pdf.(f₀,supp)
T = N * sum((p̂[k] - p0[k])^2/p0[k] for k in 1:K)
χDist = Chisq(df)
rejectedH₀ = T > quantile(χDist, 1-α)
p_value = 1 - cdf(χDist, T)
return (testValue=T, threashold=quantile(χDist,1-α), rejectedH₀=rejectedH₀, p_value=p_value)
end
function repeat_tests(f₀; datasize = 10000, repetitions = 10000, α = 0.05, d = 0)
testValue = zeros(repetitions)
rejectedH₀ = falses(repetitions)
data = rand(f₀, datasize)
for rep in 1:repetitions
rand!(f₀, data)
out = goodnessOfFitDiscrete(data, f₀; compressedData=false, α, d)
testValue[rep] = out.testValue
rejectedH₀[rep] = out.rejectedH₀
end
α_real = sum(rejectedH₀)/repetitions
(; α_real, testValue, rejectedH₀)
end
function plot_testValue_dist(f₀; datasize = 10000, repetitions = 10000, α = 0.05, d=0)
α_real, testValue, rejectedH₀ = repeat_tests(f₀; datasize, repetitions, α, d)
title = "$f₀ (real α = $α_real / nominal α = $α)"
plot(; title, titlefontsize=10)
xlim=(0, 2length(support(f₀)))
a, b = xlim
histogram!(testValue[a .≤ testValue .≤ b]; norm=true, alpha=0.3, label="testValue", xlim)
df = length(support(f₀)) - d - 1
plot!(Chisq(df), a, b; label="Chisq(df = $df)", ls=:dash, lw=2)
end
In the DiscreteUniform(0, 100) case, the distribution of testValues is almost exactly equal to the chi-square distribution.
plot_testValue_dist(DiscreteUniform(0, 100))

But, in the Binomial(100, 0.7) case, this is not the case. The real (or empirical) α value is much smaller than the nominal α value.
plot_testValue_dist(Binomial(100, 0.7))

When the degree of freedom is set to 60, the empirical α value is close to the nominal α value 0.05.
plot_testValue_dist(Binomial(100, 0.7); d = 40)

