# Example Chi Square test - why different answers from HypothesisTests and Distances?

I’m trying to reproduce an example Chi square test given in in the text book, Practical Statistics for Field Biology which compares the histogram of some numbers (O) with expected values (E). The worked example gives 5.2 with 9 degrees of freedom.

`HypothesisTests` gives 5.2 but says 19 degrees of freedom, `Distances` gives the answer 2.74

I’m wondering whether a statistician would be kind enough to comment on the below code and let me know what I’m doing wrong please. Ultimately, I’m trying to learn how to compare the histograms of two grey-scale images, and give a p-value indicating their similarity or otherwise.

``````julia> using Distances

julia> O = [10 7 10 6 14 8 11 11 12 11]
1×10 Array{Int64,2}:
10  7  10  6  14  8  11  11  12  11

julia> E = [10 10 10 10 10 10 10 10 10 10]
1×10 Array{Int64,2}:
10  10  10  10  10  10  10  10  10  10

julia> chisq_dist(E,O)
2.7429759782700955

julia> using HypothesisTests

julia> ChisqTest(hcat(E,O))
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest:   Multinomial Probabilities
value under h_0:         [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
point estimate:          [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.035, 0.05, 0.03, 0.07, 0.04, 0.055, 0.055, 0.06, 0.055]
95% confidence interval: Tuple{Float64,Float64}[(0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.0, 0.0826), (0.005, 0.0976), (0.0, 0.0776), (0.025, 0.1176), (0.0, 0.0876), (0.01, 0.1026), (0.01, 0.1026), (0.015, 0.1076), (0.01, 0.1026)]

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.9992

Details:
Sample size:        200
statistic:          5.200000000000001
degrees of freedom: 19
residuals:          [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.948683, 0.0, -1.26491, 1.26491, -0.632456, 0.316228, 0.316228, 0.632456, 0.316228]
std. residuals:     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.973329, 0.0, -1.29777, 1.29777, -0.648886, 0.324443, 0.324443, 0.648886, 0.324443]

``````

I think you need to clarify what Chi-Square test do you want to perform, goodness of fit or a contingency table?

If you run your own code to estimate the goodness of fit

``````sum(((Obs-Exp).^2./Exp))
``````

I can reproduce this answer by running

``````using HypothesisTests
ChisqTest(O)

``````

with the correct degrees of freedom, 9.

By testing both O and E, you’re really running a long goodness of fit, where the deviations occur only in O, noe in E, obtaining the same Chi Square value, but different degrees of freedom.

I assume the code in Distances tests a contingency table too.

5 Likes

Thank you - Indeed, `HypothesisTests` and `Distances` do agree if using a contingency table:

``````julia> using HypothesisTests

julia> O = [10 7 10 6 14 8 11 11 12 11]
1×10 Array{Int64,2}:
10  7  10  6  14  8  11  11  12  11

julia> E = [10 10 10 10 10 10 10 10 10 10]
1×10 Array{Int64,2}:
10  10  10  10  10  10  10  10  10  10

julia> ChisqTest(vcat(E,O)')
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest:   Multinomial Probabilities
value under h_0:         [0.05, 0.0425, 0.05, 0.04, 0.06, 0.045, 0.0525, 0.0525, 0.055, 0.0525, 0.05, 0.0425, 0.05, 0.04, 0.06, 0.045, 0.0525, 0.0525, 0.055, 0.0525]
point estimate:          [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.035, 0.05, 0.03, 0.07, 0.04, 0.055, 0.055, 0.06, 0.055]
95% confidence interval: Tuple{Float64,Float64}[(0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.0, 0.0826), (0.005, 0.0976), (0.0, 0.0776), (0.025, 0.1176), (0.0, 0.0876), (0.01, 0.1026), (0.01, 0.1026), (0.015, 0.1076), (0.01, 0.1026)]

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.9736

Details:
Sample size:        200
statistic:          2.7429759782700973
degrees of freedom: 9
residuals:          [0.0, 0.514496, 0.0, 0.707107, -0.57735, 0.333333, -0.154303, -0.154303, -0.301511, -0.154303, 0.0, -0.514496, 0.0, -0.707107, 0.57735, -0.333333, 0.154303, 0.154303, 0.301511, 0.154303]
std. residuals:     [0.0, 0.760652, 0.0, 1.04257, -0.870388, 0.494166, -0.230663, -0.230663, -0.451985, -0.230663, 0.0, -0.760652, 0.0, -1.04257, 0.870388, -0.494166, 0.230663, 0.230663, 0.451985, 0.230663]

``````
1 Like

For completeness, the goodness of fit test can be run like this:

``````julia> using HypothesisTests

julia> using LinearAlgebra

julia> O = [10, 7, 10, 6, 14, 8, 11, 11, 12, 11];

julia> E = [10, 10, 10, 10, 10, 10, 10, 10, 10, 10];

julia> ChisqTest(O,normalize(E,1))
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest:   Multinomial Probabilities
value under h_0:         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
point estimate:          [0.1, 0.07, 0.1, 0.06, 0.14, 0.08, 0.11, 0.11, 0.12, 0.11]
95% confidence interval: Tuple{Float64,Float64}[(0.02, 0.1837), (0.0, 0.1537), (0.02, 0.1837), (0.0, 0.1437), (0.06, 0.2237), (0.0, 0.1637), (0.03, 0.1937), (0.03, 0.1937), (0.04, 0.2037), (0.03, 0.1937)]

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.8165

Details:
Sample size:        100
statistic:          5.200000000000001
degrees of freedom: 9
residuals:          [0.0, -0.9486832980505138, 0.0, -1.2649110640673518, 1.2649110640673518, -0.6324555320336759, 0.31622776601683794, 0.31622776601683794, 0.6324555320336759, 0.31622776601683794]
std. residuals:     [0.0, -1.0, 0.0, -1.3333333333333333, 1.3333333333333333, -0.6666666666666666, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333]
``````
2 Likes

What if I want to use a column from a DataFrame? How would I define variables? I already am able to find the mean and median with describe().

See FreqTables.jl to compute frequency tables.

1 Like

you can also pass your own functions to `describe`.