Help with ChisqTest

darrencl · March 1, 2020, 9:39am

Hi,

I want to test my variable is independent from target y with ChisqTest from HypothesisTests.jl, so I think I would need to use the contingency table instead of goodness of fit (like sklearn’s).

First, I did one-hot-encode my categorical variable, then fetch it to ChisqTest function. I saw there is a k parameter which affect the degree of freedom (it seems degree of freedom = (k - 1)^2). I am not a statistician here, so what value should I put?

Anyway, using my one-hot-encoded feature, it seems that this produces NaN p-values in all of my feature. Why is that? I am using titanic dataset from RDatasets.jl. Here’s the sample that it produces NaN when testing one of my feature against target ‘y’ (Survived).

julia> titanic = dataset("datasets", "Titanic");

julia> X = one_hot_encode(titanic[:, [:Class, :Sex, :Age]]; drop_original=true)
32×8 DataFrame
│ Row │ Class_1st │ Class_2nd │ Class_3rd │ Class_Crew │ Sex_Female │ Sex_Male │ Age_Adult │ Age_Child │
│     │ Bool      │ Bool      │ Bool      │ Bool       │ Bool       │ Bool     │ Bool      │ Bool      │
├─────┼───────────┼───────────┼───────────┼────────────┼────────────┼──────────┼───────────┼───────────┤
│ 1   │ 1         │ 0         │ 0         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 2   │ 0         │ 1         │ 0         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 3   │ 0         │ 0         │ 1         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 4   │ 0         │ 0         │ 0         │ 1          │ 0          │ 1        │ 0         │ 1         │
│ 5   │ 1         │ 0         │ 0         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 6   │ 0         │ 1         │ 0         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 7   │ 0         │ 0         │ 1         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 8   │ 0         │ 0         │ 0         │ 1          │ 1          │ 0        │ 0         │ 1         │
│ 9   │ 1         │ 0         │ 0         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 10  │ 0         │ 1         │ 0         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 11  │ 0         │ 0         │ 1         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 12  │ 0         │ 0         │ 0         │ 1          │ 0          │ 1        │ 1         │ 0         │
│ 13  │ 1         │ 0         │ 0         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 14  │ 0         │ 1         │ 0         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 15  │ 0         │ 0         │ 1         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 16  │ 0         │ 0         │ 0         │ 1          │ 1          │ 0        │ 1         │ 0         │
│ 17  │ 1         │ 0         │ 0         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 18  │ 0         │ 1         │ 0         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 19  │ 0         │ 0         │ 1         │ 0          │ 0          │ 1        │ 0         │ 1         │
│ 20  │ 0         │ 0         │ 0         │ 1          │ 0          │ 1        │ 0         │ 1         │
│ 21  │ 1         │ 0         │ 0         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 22  │ 0         │ 1         │ 0         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 23  │ 0         │ 0         │ 1         │ 0          │ 1          │ 0        │ 0         │ 1         │
│ 24  │ 0         │ 0         │ 0         │ 1          │ 1          │ 0        │ 0         │ 1         │
│ 25  │ 1         │ 0         │ 0         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 26  │ 0         │ 1         │ 0         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 27  │ 0         │ 0         │ 1         │ 0          │ 0          │ 1        │ 1         │ 0         │
│ 28  │ 0         │ 0         │ 0         │ 1          │ 0          │ 1        │ 1         │ 0         │
│ 29  │ 1         │ 0         │ 0         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 30  │ 0         │ 1         │ 0         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 31  │ 0         │ 0         │ 1         │ 0          │ 1          │ 0        │ 1         │ 0         │
│ 32  │ 0         │ 0         │ 0         │ 1          │ 1          │ 0        │ 1         │ 0         │

julia> y = Vector{Int64}(recode(titanic.Survived,
                                    "No"=> 1,
                                    "Yes"=> 2)
                                    );

julia> X_data=convert(Matrix, X);

julia> ChisqTest(Int.(X_data[:,1]), y,2)
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.5, 0.0, 0.5, 0.0]
    point estimate:          [0.5, 0.0, 0.5, 0.0]
    95% confidence interval: Tuple{Float64,Float64}[(0.25, 0.8761), (0.0, 0.3761), (0.25, 0.8761), (0.0, 0.3761)]

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           NaN

Details:
    Sample size:        8
    statistic:          NaN
    degrees of freedom: 1
    residuals:          [0.0, NaN, 0.0, NaN]
    std. residuals:     [NaN, NaN, NaN, NaN]

alejandromerchan · March 2, 2020, 7:30pm

Hi,

I don’t think you’re doing what you really want. The Titanic dataset isn’t a “tidy” dataset, meaning that the data is already aggregated for you and the numbers that you want are in the Freq column. All you need is to sum the different numbers per category and getting the table. So, you don’t really need the one-hot encoding.

julia> using RDatasets, HypothesisTests;

julia> titanic = dataset("datasets", "Titanic");
julia> first(titanic, 8)
julia> first(titanic,8)
8×5 DataFrame
│ Row │ Class  │ Sex    │ Age    │ Survived │ Freq  │
│     │ String │ String │ String │ String   │ Int64 │
├─────┼────────┼────────┼────────┼──────────┼───────┤
│ 1   │ 1st    │ Male   │ Child  │ No       │ 0     │
│ 2   │ 2nd    │ Male   │ Child  │ No       │ 0     │
│ 3   │ 3rd    │ Male   │ Child  │ No       │ 35    │
│ 4   │ Crew   │ Male   │ Child  │ No       │ 0     │
│ 5   │ 1st    │ Female │ Child  │ No       │ 0     │
│ 6   │ 2nd    │ Female │ Child  │ No       │ 0     │
│ 7   │ 3rd    │ Female │ Child  │ No       │ 17    │
│ 8   │ Crew   │ Female │ Child  │ No       │ 0     │

First, you aggregate the data as you need.

julia> y = by(titanic, [:Sex, :Survived], [:Freq] =>
              x -> (sum(x.Freq)))
julia> y
4×3 DataFrame
│ Row │ Sex    │ Survived │ x1    │
│     │ String │ String   │ Int64 │
├─────┼────────┼──────────┼───────┤
│ 1   │ Male   │ No       │ 1364  │
│ 2   │ Male   │ Yes      │ 367   │
│ 3   │ Female │ No       │ 126   │
│ 4   │ Female │ Yes      │ 344   │

The you need to take the x1 column and reshape it as a matrix.

julia> x = reshape(y.x1, (2,2))
julia> x
2×2 Array{Int64,2}:
 1364  126
  367  344

As you can see, this matrix will have the Sex in the columns and the Survived in the rows. And the you run you Chi Square contingency table.

julia> ChisqTest(x)
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.5324063800663901, 0.2540543196155727, 0.14455863583547274, 0.06898066448256451]
    point estimate:          [0.6197183098591549, 0.1667423898228078, 0.05724670604270786, 0.1562925942753294]
    95% confidence interval: Tuple{Float64,Float64}[(0.5936, 0.6452), (0.1478, 0.1875), (0.0461, 0.0709), (0.1379, 0.1766)]

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           <1e-99

Details:
    Sample size:        2201
    statistic:          456.87415626043986
    degrees of freedom: 1
    residuals:          [5.61386523562475, -8.12681393766904, -10.773618376324867, 15.596240434172172]
    std. residuals:     [21.37461476285455, -21.37461476285455, -21.374614762854552, 21.37461476285456]

I’m not sure if this is the most direct way of doing that operation with that dataset, but this works. Hope this helps.

Albert_Zevelev · March 2, 2020, 8:18pm

Where did you find the one_hot_encode() function?

mthelm85 · March 2, 2020, 8:39pm

Where did you find the one_hot_encode() function?

@Albert_Zevelev The only package I can think of that exports a function specifically for one-hot encoding is Flux.

That being said, here’s a quick-and-dirty function that accomplishes what the function in the OP’s example accomplishes (more or less):

function one_hot_encode(df::DataFrame)
    encoded = DataFrame()
    for col in names(df), val in unique(df[!, col])
        encoded[!, Symbol(val)] = ifelse.(df[!, col] .== val, 1, 0)
    end
    return encoded
end

And here it is in action with this specific example:

using DataFrames
using RDatasets

titanic = dataset("datasets", "Titanic")

julia> one_hot_encode(titanic[:, [:Class, :Sex, :Age]])
32×8 DataFrame
│ Row │ 1st   │ 2nd   │ 3rd   │ Crew  │ Male  │ Female │ Child │ Adult │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼────────┼───────┼───────┤
│ 1   │ 1     │ 0     │ 0     │ 0     │ 1     │ 0      │ 1     │ 0     │
│ 2   │ 0     │ 1     │ 0     │ 0     │ 1     │ 0      │ 1     │ 0     │
│ 3   │ 0     │ 0     │ 1     │ 0     │ 1     │ 0      │ 1     │ 0     │
│ 4   │ 0     │ 0     │ 0     │ 1     │ 1     │ 0      │ 1     │ 0     │
│ 5   │ 1     │ 0     │ 0     │ 0     │ 0     │ 1      │ 1     │ 0     │
│ 6   │ 0     │ 1     │ 0     │ 0     │ 0     │ 1      │ 1     │ 0     │
│ 7   │ 0     │ 0     │ 1     │ 0     │ 0     │ 1      │ 1     │ 0     │
│ 8   │ 0     │ 0     │ 0     │ 1     │ 0     │ 1      │ 1     │ 0     │
│ 9   │ 1     │ 0     │ 0     │ 0     │ 1     │ 0      │ 0     │ 1     │
⋮
│ 23  │ 0     │ 0     │ 1     │ 0     │ 0     │ 1      │ 1     │ 0     │
│ 24  │ 0     │ 0     │ 0     │ 1     │ 0     │ 1      │ 1     │ 0     │
│ 25  │ 1     │ 0     │ 0     │ 0     │ 1     │ 0      │ 0     │ 1     │
│ 26  │ 0     │ 1     │ 0     │ 0     │ 1     │ 0      │ 0     │ 1     │
│ 27  │ 0     │ 0     │ 1     │ 0     │ 1     │ 0      │ 0     │ 1     │
│ 28  │ 0     │ 0     │ 0     │ 1     │ 1     │ 0      │ 0     │ 1     │
│ 29  │ 1     │ 0     │ 0     │ 0     │ 0     │ 1      │ 0     │ 1     │
│ 30  │ 0     │ 1     │ 0     │ 0     │ 0     │ 1      │ 0     │ 1     │
│ 31  │ 0     │ 0     │ 1     │ 0     │ 0     │ 1      │ 0     │ 1     │
│ 32  │ 0     │ 0     │ 0     │ 1     │ 0     │ 1      │ 0     │ 1     │

Albert_Zevelev · March 2, 2020, 8:55pm

Thanks!
Your code is much neater than (MLUtils.jl/utils.jl at master · marubontan/MLUtils.jl · GitHub).

A few points:
1 in the linked code the names become “class_1st” etc, in your code they become “1st” etc
update:

function one_hot_encode(df::DataFrame)
    encoded = DataFrame()
    for col in names(df), val in unique(df[!, col])
        encoded[!, Symbol( string(col) * "_" * string(val)) ] = ifelse.(df[!, col] .== val, 1, 0)
    end
    return encoded
end

2 in economics we call these factor variables (dummy variables) & econometric software automatically omits one level for each to avoid multicollinearity etc (when there is an intercept).
Do you guys know if ML packages such as MLJ/Flux et al do this as well?
Is it hard to do this in the code above, create:
class_1st, class_2nd, class_3rd omitting class_Crew
Sex_Male, omiting Sex_Female

update: here is how I do it

function one_hot_encode(df::DataFrame)
    encoded = DataFrame()
    for col in names(df), val in unique(df[!, col])[1:(end -1),1]
        lab = string(col) * "_" * string(val)
        encoded[!, Symbol(lab) ] = ifelse.(df[!, col] .== val, 1, 0)
    end
    return encoded
end

darrencl · March 2, 2020, 11:20pm

Thanks a lot! Really appreciate it.

darrencl · March 2, 2020, 11:21pm

Oops! I wrote it my own and forgot to attach it

My code is not as neat as @mthelm85 has though.

Topic		Replies	Views
Example Chi Square test - why different answers from HypothesisTests and Distances? Statistics first-steps	6	4021	February 26, 2020
Hypothesis testing in Julia Statistics question	4	1819	March 14, 2022
Entering xlsx columns into HypothosisTests Statistics gettingstarted	26	1465	February 27, 2020
All the ways to do one-hot encoding General Usage	30	11312	October 20, 2024
Chi-Square test of a sample Statistics question	19	3514	July 16, 2021

Help with ChisqTest

Related topics