Compute frequency or proportions on grouped dataframes

I would like to learn all possible ways of summarizing categorical data using the dataframes ecosystem.
Preferably, I would like to compute the 1) frequency 2) proportions of each categorical variable after a grouping operation (preferably in a chain ed operation). Here is an example we can use. The result of countmap is not really presentable, but I am sure there are ways of converting that into a meaningful dataframe.

julia> dd = DataFrame(a = ["a", "b","a", "b"], b = ["no", "yes", "no", "no"], c = ["lo", "lo", "hi", "hi"])
4Γ—3 DataFrame
 Row β”‚ a       b       c      
     β”‚ String  String  String 
─────┼────────────────────────
   1 β”‚ a       no      lo
   2 β”‚ b       yes     lo
   3 β”‚ a       no      hi
   4 β”‚ b       no      hi

julia> cat_summary = @chain dd begin
           groupby(_, [:a])
           combine(_, vec([:b,:c] .=> countmap))
       end
2Γ—3 DataFrame
 Row β”‚ a       b_countmap               c_countmap             
     β”‚ String  Dict…                    Dict…                  
─────┼─────────────────────────────────────────────────────────
   1 β”‚ a       Dict("no"=>2)            Dict("hi"=>1, "lo"=>1)
   2 β”‚ b       Dict("yes"=>1, "no"=>1)  Dict("hi"=>1, "lo"=>1)

This is a touch question. I think a main problem is that if :b and :c have different numbers of categories, it’s hard to imagine a way to present this data as vectors of pairs rather than Dicts.

Do you have a particular output type in mind?

perhaps we can start with just b ?

it is not clear what you expect, but if you do not find better, try to adapt a scheme of the following type to your case

d = DataFrame(A = ["a", "b","a", "b"], B = ["no", "yes", "no", "no"], C = ["low", "low", "hi", "hi"])
gdd=groupby(dd,:A)
dx=Dict("no"=>0,"yes"=>0)
dy=Dict("hi"=>0,"low"=>0)
comb=combine(gdd,[:B,:C].=>countmap.=>[:Bb,:Cc])
tr=transform(comb,[:Bb,:Cc]=>ByRow((x,y)->[merge(dx,x),merge(dy,y)])=>[:Bb,:Cc])
transform(tr,[:Bb,:Cc].=>identity=>AsTable)

Here is something pretty good. Maybe someone can come up with something better, though.

julia> @chain dd begin 
           @aside v = unique(dd.b)
           groupby(:a)
           @combine b_countmap = begin 
               d = countmap(:b)
               for vi in v
                   get!(d, vi, 0)
               end
               d
           end
           flatten(:b_countmap)
           transform(:b_countmap => ByRow(b -> (b_value = first(b), b_count = last(b))) => AsTable)
           select(Not(:b_countmap))
       end
4Γ—3 DataFrame
 Row β”‚ a       b_value  b_count 
     β”‚ String  String   Int64   
─────┼──────────────────────────
   1 β”‚ a       yes            0
   2 β”‚ a       no             2
   3 β”‚ b       yes            1
   4 β”‚ b       no             1

Isn’t that getting close to the normal

@chain df begin
    groupby([:a, :b])
    combine(nrow => :count)
end
2 Likes

Yeah it is lol. this is the correct answer.

I’ve had this mental twist before, I’m thinking about groups of a, and then counts of instances of b, but really it’s counts of groups of [a, b] :man_shrugging:

Yes, and then:

@chain df begin
    groupby([:a, :b])
    combine(nrow => :count)
    groupby(:a)
    combine(:count => (x -> x / sum(x)) => :prop)
end

to get proportions. At some point we will add add proprow and rownumber by bkamins Β· Pull Request #2556 Β· JuliaData/DataFrames.jl Β· GitHub.

2 Likes

And :b and :c at the same time can probably only really be handled by stacking them, because they don’t correspond to each other:

@chain dd begin
    stack([:b, :c])
    groupby([:a, :variable, :value])
    combine(nrow => :count)
end
7Γ—4 DataFrame
 Row β”‚ a       variable  value   count 
     β”‚ String  String    String  Int64 
─────┼─────────────────────────────────
   1 β”‚ a       b         no          2
   2 β”‚ b       b         yes         1
   3 β”‚ b       b         no          1
   4 β”‚ a       c         lo          1
   5 β”‚ b       c         lo          1
   6 β”‚ a       c         hi          1
   7 β”‚ b       c         hi          1
3 Likes

Indeed, it is a straightforward stack + nrow and the computing the ratio of n/ntotal. Thank you all. This is what I was looking for