I would like to learn all possible ways of summarizing categorical data using the dataframes ecosystem.
Preferably, I would like to compute the 1) frequency 2) proportions of each categorical variable after a grouping operation (preferably in a chain ed operation). Here is an example we can use. The result of countmap is not really presentable, but I am sure there are ways of converting that into a meaningful dataframe.
julia> dd = DataFrame(a = ["a", "b","a", "b"], b = ["no", "yes", "no", "no"], c = ["lo", "lo", "hi", "hi"])
4Γ3 DataFrame
Row β a b c
β String String String
ββββββΌββββββββββββββββββββββββ
1 β a no lo
2 β b yes lo
3 β a no hi
4 β b no hi
julia> cat_summary = @chain dd begin
groupby(_, [:a])
combine(_, vec([:b,:c] .=> countmap))
end
2Γ3 DataFrame
Row β a b_countmap c_countmap
β String Dictβ¦ Dictβ¦
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β a Dict("no"=>2) Dict("hi"=>1, "lo"=>1)
2 β b Dict("yes"=>1, "no"=>1) Dict("hi"=>1, "lo"=>1)
This is a touch question. I think a main problem is that if :b and :c have different numbers of categories, itβs hard to imagine a way to present this data as vectors of pairs rather than Dicts.
Here is something pretty good. Maybe someone can come up with something better, though.
julia> @chain dd begin
@aside v = unique(dd.b)
groupby(:a)
@combine b_countmap = begin
d = countmap(:b)
for vi in v
get!(d, vi, 0)
end
d
end
flatten(:b_countmap)
transform(:b_countmap => ByRow(b -> (b_value = first(b), b_count = last(b))) => AsTable)
select(Not(:b_countmap))
end
4Γ3 DataFrame
Row β a b_value b_count
β String String Int64
ββββββΌββββββββββββββββββββββββββ
1 β a yes 0
2 β a no 2
3 β b yes 1
4 β b no 1
And :b and :c at the same time can probably only really be handled by stacking them, because they donβt correspond to each other:
@chain dd begin
stack([:b, :c])
groupby([:a, :variable, :value])
combine(nrow => :count)
end
7Γ4 DataFrame
Row β a variable value count
β String String String Int64
ββββββΌβββββββββββββββββββββββββββββββββ
1 β a b no 2
2 β b b yes 1
3 β b b no 1
4 β a c lo 1
5 β b c lo 1
6 β a c hi 1
7 β b c hi 1