# Compute frequency or proportions on grouped dataframes

I would like to learn all possible ways of summarizing categorical data using the dataframes ecosystem.
Preferably, I would like to compute the 1) frequency 2) proportions of each categorical variable after a grouping operation (preferably in a chain ed operation). Here is an example we can use. The result of `countmap` is not really presentable, but I am sure there are ways of converting that into a meaningful dataframe.

``````julia> dd = DataFrame(a = ["a", "b","a", "b"], b = ["no", "yes", "no", "no"], c = ["lo", "lo", "hi", "hi"])
4Γ3 DataFrame
Row β a       b       c
β String  String  String
ββββββΌββββββββββββββββββββββββ
1 β a       no      lo
2 β b       yes     lo
3 β a       no      hi
4 β b       no      hi

julia> cat_summary = @chain dd begin
groupby(_, [:a])
combine(_, vec([:b,:c] .=> countmap))
end
2Γ3 DataFrame
Row β a       b_countmap               c_countmap
β String  Dictβ¦                    Dictβ¦
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β a       Dict("no"=>2)            Dict("hi"=>1, "lo"=>1)
2 β b       Dict("yes"=>1, "no"=>1)  Dict("hi"=>1, "lo"=>1)
``````

This is a touch question. I think a main problem is that if `:b` and `:c` have different numbers of categories, itβs hard to imagine a way to present this data as vectors of pairs rather than `Dict`s.

Do you have a particular output type in mind?

perhaps we can start with just `b` ?

it is not clear what you expect, but if you do not find better, try to adapt a scheme of the following type to your case

``````d = DataFrame(A = ["a", "b","a", "b"], B = ["no", "yes", "no", "no"], C = ["low", "low", "hi", "hi"])
gdd=groupby(dd,:A)
dx=Dict("no"=>0,"yes"=>0)
dy=Dict("hi"=>0,"low"=>0)
comb=combine(gdd,[:B,:C].=>countmap.=>[:Bb,:Cc])
tr=transform(comb,[:Bb,:Cc]=>ByRow((x,y)->[merge(dx,x),merge(dy,y)])=>[:Bb,:Cc])
transform(tr,[:Bb,:Cc].=>identity=>AsTable)
``````

Here is something pretty good. Maybe someone can come up with something better, though.

``````julia> @chain dd begin
@aside v = unique(dd.b)
groupby(:a)
@combine b_countmap = begin
d = countmap(:b)
for vi in v
get!(d, vi, 0)
end
d
end
flatten(:b_countmap)
transform(:b_countmap => ByRow(b -> (b_value = first(b), b_count = last(b))) => AsTable)
select(Not(:b_countmap))
end
4Γ3 DataFrame
Row β a       b_value  b_count
β String  String   Int64
ββββββΌββββββββββββββββββββββββββ
1 β a       yes            0
2 β a       no             2
3 β b       yes            1
4 β b       no             1
``````

Isnβt that getting close to the normal

``````@chain df begin
groupby([:a, :b])
combine(nrow => :count)
end
``````
2 Likes

Yeah it is lol. this is the correct answer.

Iβve had this mental twist before, Iβm thinking about groups of a, and then counts of instances of b, but really itβs counts of groups of [a, b]

Yes, and then:

``````@chain df begin
groupby([:a, :b])
combine(nrow => :count)
groupby(:a)
combine(:count => (x -> x / sum(x)) => :prop)
end
``````

to get proportions. At some point we will add add proprow and rownumber by bkamins Β· Pull Request #2556 Β· JuliaData/DataFrames.jl Β· GitHub.

2 Likes

And `:b` and `:c` at the same time can probably only really be handled by stacking them, because they donβt correspond to each other:

``````@chain dd begin
stack([:b, :c])
groupby([:a, :variable, :value])
combine(nrow => :count)
end
``````
``````7Γ4 DataFrame
Row β a       variable  value   count
β String  String    String  Int64
ββββββΌβββββββββββββββββββββββββββββββββ
1 β a       b         no          2
2 β b       b         yes         1
3 β b       b         no          1
4 β a       c         lo          1
5 β b       c         lo          1
6 β a       c         hi          1
7 β b       c         hi          1
``````
3 Likes

Indeed, it is a straightforward stack + nrow and the computing the ratio of n/ntotal. Thank you all. This is what I was looking for