Dataframes and categorical, good style?

I’m trying to write some Kaggle Titanic code and to show context first:

datadirectory = "../data/titanic/"
train_df = CSV.read(datadirectory * "train.csv")
describe(train_df)
variable	mean	min	median	max	nunique	nmissing	eltype
Symbol	Union…	Any	Union…	Any	Union…	Union…	Type
1	PassengerId	446.0	1	446.0	891			Int64
2	Survived	0.383838	0	0.0	1			Int64
3	Pclass	2.30864	1	3.0	3			Int64
4	Name		Abbing, Mr. Anthony		van Melkebeke, Mr. Philemon	891		String
5	Sex		female		male	2		String
6	Age	29.6991	0.42	28.0	80.0		177	Union{Missing, Float64}
7	SibSp	0.523008	0	0.0	8			Int64
8	Parch	0.381594	0	0.0	6			Int64
9	Ticket		110152		WE/P 5735	681		String
10	Fare	32.2042	0.0	14.4542	512.329			Float64
11	Cabin		A10		T	147	687	Union{Missing, String}
12	Embarked		C		S	3	2	Union{Missing, String}

Now I’m trying to do some transformations to this, but I’m not really sure how I should be writing this, and I get slightly odd results. I’m not at all certain that this code is good Julia, probably it isn’t.

train_df = @pipe train_df |>
    @select(_, :PassengerId, :Survived, :Pclass, :Sex, :Age) |> 
    @transform(_, Survived = convert.(Bool, :Survived)) |>
    categorical(_, :Pclass) |>
    categorical(_, :Sex)
describe(train_df)
	variable	mean	min	median	max	nunique	nmissing	eltype
Symbol	Union…	Any	Union…	Any	Union…	Union…	Type
1	PassengerId	446.0	1	446.0	891			Int64
2	Survived	0.383838	0	0.0	1			Bool
3	Pclass		1		3	3		CategoricalValue{Int64,UInt32}
4	Sex		female		male	2		CategoricalValue{String,UInt32}
5	Age	29.6991	0.42	28.0	80.0		177	Union{Missing, Float64}

Why did I get CategoricalValue{Int64,UInt32} and CategoricalValue{String,UInt32}? The Sex column should have just female or male, and the Pclass just 1,2, or 3.

What would be good Julia style to write that piece of code?

Sorry, what did you expect from this? You asked for categorical variables, and you got them.

I guess the confusion is why categorical values are wrapped by default. The reason is - as @pdeffebach commented - that this wrapper signals that this is a categorical value rather than the non-categorical value. You can get the wrapped value using the get function.

I don’t get why the categorical values have two types inside the curly braces: Int64 and UInt32, String and UInt32. Why not just one?

Because the categorical is a mapping from an underlying type (in your case presumably Int for Pclass and String for Sex) to a pool of levels, which are stored as UInt32

2 Likes

Oh! I see, so it’s fine.