Correlation between categorial variables

I created a DataFrame from a CSV simply doing df = DataFrame(CSV.File(“input.csv”))
Using describe(df) I noticed that I have both Int variables and String variables with eltype String7, String3 and so on.
To calculate the correlation between int variables I simply do

num_var_names = names(df, Int64)[2:end]
num_vars = train[!, num_var_names]
cor_matrix = cor(Matrix(num_vars))

I want also to calculate the correlation between the categorical variables (identified as StringX).
My first attempt was to simply do something like

cor(Matrix(df[!, ["var1", "var2"]]))

but I get
MethodError: no method matching /(::InlineStrings.String7, ::Int64)

If I print out df[!, “var1”] I get
PooledArrays.PooledVector{InlineStrings.String7, UInt32, Vector{UInt32}}: [values...]

I also try to convert to categorical doing

cor(CategoricalArray(train[!, "MSZoning"]), CategoricalArray(train[!, "MSZoning"]))

but i still get

MethodError: no method matching /(::CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}, ::Int64)

So, is there a way to calculate the correlation between categorical variables? It would be good if I can do something as simple as I did for the numerical variables.

Thanks.

If no-one comes forward with an existing solution, here’s how to do it in Python DataFrames

As Matt’s link alludes to, this isn’t really a Julia but a methodology question:

help?> cor

(...)

  cor(x::AbstractVector, y::AbstractVector)

  Compute the Pearson correlation between the vectors x and y.

so cor gives you the Pearson correlation coefficient, defined as:

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2(y_i - \bar{y})^2}

Clearly then cor doesn’t make sense for categorical variables, as there’s no notion of an average of categoricals, or the distance of an individual observation from another (or the mean).

The article above recommends the Chi-square test to measure dependency between two categorical variables, which is available in the HypothesisTests package here:

https://juliastats.org/HypothesisTests.jl/stable/parametric/#Pearson-chi-squared-test-1

but there are other options (see e.g. an overview here or a recent paper where a new test is proposed here) so you might first want to decide which metric you’re interested in, then see if this is avaible somewhere in Julia.

3 Likes

Thanks for the answer! I knew about chi2 test and methodology. I was just wondering if there is an easy way to calculate it directly from the dataframe, or a simple way to create the contingency table from the dataframe. The chi2 test is what i wanted to use.
Thanks for the library and also thanks for the proposed paper, very interesting!