Correlation between categorial variables

iskyd · January 11, 2022, 5:06pm

I created a DataFrame from a CSV simply doing df = DataFrame(CSV.File(“input.csv”))
Using describe(df) I noticed that I have both Int variables and String variables with eltype String7, String3 and so on.
To calculate the correlation between int variables I simply do

num_var_names = names(df, Int64)[2:end]
num_vars = train[!, num_var_names]
cor_matrix = cor(Matrix(num_vars))

I want also to calculate the correlation between the categorical variables (identified as StringX).
My first attempt was to simply do something like

cor(Matrix(df[!, ["var1", "var2"]]))

but I get
MethodError: no method matching /(::InlineStrings.String7, ::Int64)

If I print out df[!, “var1”] I get
PooledArrays.PooledVector{InlineStrings.String7, UInt32, Vector{UInt32}}: [values...]

I also try to convert to categorical doing

cor(CategoricalArray(train[!, "MSZoning"]), CategoricalArray(train[!, "MSZoning"]))

but i still get

MethodError: no method matching /(::CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}, ::Int64)

So, is there a way to calculate the correlation between categorical variables? It would be good if I can do something as simple as I did for the numerical variables.

Thanks.

lawless-m · January 12, 2022, 8:58am

If no-one comes forward with an existing solution, here’s how to do it in Python DataFrames

nilshg · January 12, 2022, 9:44am

As Matt’s link alludes to, this isn’t really a Julia but a methodology question:

help?> cor

(...)

  cor(x::AbstractVector, y::AbstractVector)

  Compute the Pearson correlation between the vectors x and y.

so cor gives you the Pearson correlation coefficient, defined as:

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2(y_i - \bar{y})^2}

Clearly then cor doesn’t make sense for categorical variables, as there’s no notion of an average of categoricals, or the distance of an individual observation from another (or the mean).

The article above recommends the Chi-square test to measure dependency between two categorical variables, which is available in the HypothesisTests package here:

https://juliastats.org/HypothesisTests.jl/stable/parametric/#Pearson-chi-squared-test-1

but there are other options (see e.g. an overview here or a recent paper where a new test is proposed here) so you might first want to decide which metric you’re interested in, then see if this is avaible somewhere in Julia.

iskyd · January 14, 2022, 4:56pm

Thanks for the answer! I knew about chi2 test and methodology. I was just wondering if there is an easy way to calculate it directly from the dataframe, or a simple way to create the contingency table from the dataframe. The chi2 test is what i wanted to use.
Thanks for the library and also thanks for the proposed paper, very interesting!

Topic		Replies	Views
Whats the easiest way to create correlation matrices in Julia? New to Julia question , statistics	5	4288	November 5, 2021
First impression of DataFrames.jl New to Julia dataframes	4	1908	November 8, 2020
Spearman Correlation, How do I find rho? Statistics	17	2525	March 27, 2020
Why Julia machine learning is so unfriendly? Very "unsmooth" experience from foolish guy Machine Learning first-steps	15	3499	March 3, 2019
Correlation Matrix vs Contingency Table Statistics	4	3253	March 26, 2020

Correlation between categorial variables

Related topics