Example Chi Square test - why different answers from HypothesisTests and Distances?

robblackwell · August 13, 2018, 4:10pm

I’m trying to reproduce an example Chi square test given in in the text book, Practical Statistics for Field Biology which compares the histogram of some numbers (O) with expected values (E). The worked example gives 5.2 with 9 degrees of freedom.

HypothesisTests gives 5.2 but says 19 degrees of freedom, Distances gives the answer 2.74

I’m wondering whether a statistician would be kind enough to comment on the below code and let me know what I’m doing wrong please. Ultimately, I’m trying to learn how to compare the histograms of two grey-scale images, and give a p-value indicating their similarity or otherwise.

julia> using Distances

julia> O = [10 7 10 6 14 8 11 11 12 11]
1×10 Array{Int64,2}:
 10  7  10  6  14  8  11  11  12  11

julia> E = [10 10 10 10 10 10 10 10 10 10]
1×10 Array{Int64,2}:
 10  10  10  10  10  10  10  10  10  10

julia> chisq_dist(E,O)
2.7429759782700955

julia> using HypothesisTests

julia> ChisqTest(hcat(E,O)) 
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
    point estimate:          [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.035, 0.05, 0.03, 0.07, 0.04, 0.055, 0.055, 0.06, 0.055]
    95% confidence interval: Tuple{Float64,Float64}[(0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.0, 0.0826), (0.005, 0.0976), (0.0, 0.0776), (0.025, 0.1176), (0.0, 0.0876), (0.01, 0.1026), (0.01, 0.1026), (0.015, 0.1076), (0.01, 0.1026)]

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.9992

Details:
    Sample size:        200
    statistic:          5.200000000000001
    degrees of freedom: 19
    residuals:          [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.948683, 0.0, -1.26491, 1.26491, -0.632456, 0.316228, 0.316228, 0.632456, 0.316228]
    std. residuals:     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.973329, 0.0, -1.29777, 1.29777, -0.648886, 0.324443, 0.324443, 0.648886, 0.324443]

alejandromerchan · August 13, 2018, 5:20pm

I think you need to clarify what Chi-Square test do you want to perform, goodness of fit or a contingency table?

If you run your own code to estimate the goodness of fit

sum(((Obs-Exp).^2./Exp))

you obtain the 5.2 answer.
I can reproduce this answer by running

using HypothesisTests
ChisqTest(O)

with the correct degrees of freedom, 9.

By testing both O and E, you’re really running a long goodness of fit, where the deviations occur only in O, noe in E, obtaining the same Chi Square value, but different degrees of freedom.

I assume the code in Distances tests a contingency table too.

robblackwell · August 14, 2018, 10:04am

Thank you - Indeed, HypothesisTests and Distances do agree if using a contingency table:

julia> using HypothesisTests

julia> O = [10 7 10 6 14 8 11 11 12 11]
1×10 Array{Int64,2}:
 10  7  10  6  14  8  11  11  12  11

julia> E = [10 10 10 10 10 10 10 10 10 10]
1×10 Array{Int64,2}:
 10  10  10  10  10  10  10  10  10  10

julia> ChisqTest(vcat(E,O)') 
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.05, 0.0425, 0.05, 0.04, 0.06, 0.045, 0.0525, 0.0525, 0.055, 0.0525, 0.05, 0.0425, 0.05, 0.04, 0.06, 0.045, 0.0525, 0.0525, 0.055, 0.0525]
    point estimate:          [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.035, 0.05, 0.03, 0.07, 0.04, 0.055, 0.055, 0.06, 0.055]
    95% confidence interval: Tuple{Float64,Float64}[(0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.005, 0.0976), (0.0, 0.0826), (0.005, 0.0976), (0.0, 0.0776), (0.025, 0.1176), (0.0, 0.0876), (0.01, 0.1026), (0.01, 0.1026), (0.015, 0.1076), (0.01, 0.1026)]

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.9736

Details:
    Sample size:        200
    statistic:          2.7429759782700973
    degrees of freedom: 9
    residuals:          [0.0, 0.514496, 0.0, 0.707107, -0.57735, 0.333333, -0.154303, -0.154303, -0.301511, -0.154303, 0.0, -0.514496, 0.0, -0.707107, 0.57735, -0.333333, 0.154303, 0.154303, 0.301511, 0.154303]
    std. residuals:     [0.0, 0.760652, 0.0, 1.04257, -0.870388, 0.494166, -0.230663, -0.230663, -0.451985, -0.230663, 0.0, -0.760652, 0.0, -1.04257, 0.870388, -0.494166, 0.230663, 0.230663, 0.451985, 0.230663]

robblackwell · November 28, 2019, 3:38pm

For completeness, the goodness of fit test can be run like this:

julia> using HypothesisTests

julia> using LinearAlgebra

julia> O = [10, 7, 10, 6, 14, 8, 11, 11, 12, 11];

julia> E = [10, 10, 10, 10, 10, 10, 10, 10, 10, 10];

julia> ChisqTest(O,normalize(E,1))
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
    point estimate:          [0.1, 0.07, 0.1, 0.06, 0.14, 0.08, 0.11, 0.11, 0.12, 0.11]
    95% confidence interval: Tuple{Float64,Float64}[(0.02, 0.1837), (0.0, 0.1537), (0.02, 0.1837), (0.0, 0.1437), (0.06, 0.2237), (0.0, 0.1637), (0.03, 0.1937), (0.03, 0.1937), (0.04, 0.2037), (0.03, 0.1937)]

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.8165

Details:
    Sample size:        100
    statistic:          5.200000000000001
    degrees of freedom: 9
    residuals:          [0.0, -0.9486832980505138, 0.0, -1.2649110640673518, 1.2649110640673518, -0.6324555320336759, 0.31622776601683794, 0.31622776601683794, 0.6324555320336759, 0.31622776601683794]
    std. residuals:     [0.0, -1.0, 0.0, -1.3333333333333333, 1.3333333333333333, -0.6666666666666666, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333]

brett_knoss · February 26, 2020, 8:24am

What if I want to use a column from a DataFrame? How would I define variables? I already am able to find the mean and median with describe().

nalimilan · February 26, 2020, 9:12am

See FreqTables.jl to compute frequency tables.

pdeffebach · February 26, 2020, 1:43pm

you can also pass your own functions to describe.

Topic		Replies	Views
Hypothesis testing in Julia Statistics question	4	1820	March 14, 2022
Chi-Square test of a sample Statistics question	19	3517	July 16, 2021
Help with ChisqTest New to Julia question , statistics	6	2555	March 2, 2020
Entering xlsx columns into HypothosisTests Statistics gettingstarted	26	1465	February 27, 2020
What's wrong with my chi-squared goodness of fits tests? Statistics	5	1758	July 10, 2021

Example Chi Square test - why different answers from HypothesisTests and Distances?

Related topics