I created a DataFrame from a CSV simply doing df = DataFrame(CSV.File(“input.csv”))
Using describe(df) I noticed that I have both Int variables and String variables with eltype String7, String3 and so on.
To calculate the correlation between int variables I simply do
MethodError: no method matching /(::CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}, ::Int64)
So, is there a way to calculate the correlation between categorical variables? It would be good if I can do something as simple as I did for the numerical variables.
Clearly then cor doesn’t make sense for categorical variables, as there’s no notion of an average of categoricals, or the distance of an individual observation from another (or the mean).
The article above recommends the Chi-square test to measure dependency between two categorical variables, which is available in the HypothesisTests package here:
but there are other options (see e.g. an overview here or a recent paper where a new test is proposed here) so you might first want to decide which metric you’re interested in, then see if this is avaible somewhere in Julia.
Thanks for the answer! I knew about chi2 test and methodology. I was just wondering if there is an easy way to calculate it directly from the dataframe, or a simple way to create the contingency table from the dataframe. The chi2 test is what i wanted to use.
Thanks for the library and also thanks for the proposed paper, very interesting!