Help with ChisqTest

Hi,

I don’t think you’re doing what you really want. The Titanic dataset isn’t a “tidy” dataset, meaning that the data is already aggregated for you and the numbers that you want are in the Freq column. All you need is to sum the different numbers per category and getting the table. So, you don’t really need the one-hot encoding.

julia> using RDatasets, HypothesisTests;

julia> titanic = dataset("datasets", "Titanic");
julia> first(titanic, 8)
julia> first(titanic,8)
8×5 DataFrame
│ Row │ Class  │ Sex    │ Age    │ Survived │ Freq  │
│     │ String │ String │ String │ String   │ Int64 │
├─────┼────────┼────────┼────────┼──────────┼───────┤
│ 1   │ 1st    │ Male   │ Child  │ No       │ 0     │
│ 2   │ 2nd    │ Male   │ Child  │ No       │ 0     │
│ 3   │ 3rd    │ Male   │ Child  │ No       │ 35    │
│ 4   │ Crew   │ Male   │ Child  │ No       │ 0     │
│ 5   │ 1st    │ Female │ Child  │ No       │ 0     │
│ 6   │ 2nd    │ Female │ Child  │ No       │ 0     │
│ 7   │ 3rd    │ Female │ Child  │ No       │ 17    │
│ 8   │ Crew   │ Female │ Child  │ No       │ 0     │

First, you aggregate the data as you need.

julia> y = by(titanic, [:Sex, :Survived], [:Freq] =>
              x -> (sum(x.Freq)))
julia> y
4×3 DataFrame
│ Row │ Sex    │ Survived │ x1    │
│     │ String │ String   │ Int64 │
├─────┼────────┼──────────┼───────┤
│ 1   │ Male   │ No       │ 1364  │
│ 2   │ Male   │ Yes      │ 367   │
│ 3   │ Female │ No       │ 126   │
│ 4   │ Female │ Yes      │ 344   │

The you need to take the x1 column and reshape it as a matrix.

julia> x = reshape(y.x1, (2,2))
julia> x
2×2 Array{Int64,2}:
 1364  126
  367  344

As you can see, this matrix will have the Sex in the columns and the Survived in the rows. And the you run you Chi Square contingency table.

julia> ChisqTest(x)
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.5324063800663901, 0.2540543196155727, 0.14455863583547274, 0.06898066448256451]
    point estimate:          [0.6197183098591549, 0.1667423898228078, 0.05724670604270786, 0.1562925942753294]
    95% confidence interval: Tuple{Float64,Float64}[(0.5936, 0.6452), (0.1478, 0.1875), (0.0461, 0.0709), (0.1379, 0.1766)]

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           <1e-99

Details:
    Sample size:        2201
    statistic:          456.87415626043986
    degrees of freedom: 1
    residuals:          [5.61386523562475, -8.12681393766904, -10.773618376324867, 15.596240434172172]
    std. residuals:     [21.37461476285455, -21.37461476285455, -21.374614762854552, 21.37461476285456]

I’m not sure if this is the most direct way of doing that operation with that dataset, but this works. Hope this helps.

3 Likes