Compute frequency or proportions on grouped dataframes

vjd · June 13, 2021, 1:33pm

I would like to learn all possible ways of summarizing categorical data using the dataframes ecosystem.
Preferably, I would like to compute the 1) frequency 2) proportions of each categorical variable after a grouping operation (preferably in a chain ed operation). Here is an example we can use. The result of countmap is not really presentable, but I am sure there are ways of converting that into a meaningful dataframe.

julia> dd = DataFrame(a = ["a", "b","a", "b"], b = ["no", "yes", "no", "no"], c = ["lo", "lo", "hi", "hi"])
4×3 DataFrame
 Row │ a       b       c      
     │ String  String  String 
─────┼────────────────────────
   1 │ a       no      lo
   2 │ b       yes     lo
   3 │ a       no      hi
   4 │ b       no      hi

julia> cat_summary = @chain dd begin
           groupby(_, [:a])
           combine(_, vec([:b,:c] .=> countmap))
       end
2×3 DataFrame
 Row │ a       b_countmap               c_countmap             
     │ String  Dict…                    Dict…                  
─────┼─────────────────────────────────────────────────────────
   1 │ a       Dict("no"=>2)            Dict("hi"=>1, "lo"=>1)
   2 │ b       Dict("yes"=>1, "no"=>1)  Dict("hi"=>1, "lo"=>1)

pdeffebach · June 13, 2021, 2:07pm

This is a touch question. I think a main problem is that if :b and :c have different numbers of categories, it’s hard to imagine a way to present this data as vectors of pairs rather than Dicts.

Do you have a particular output type in mind?

vjd · June 13, 2021, 2:09pm

perhaps we can start with just b ?

rocco_sprmnt21 · June 13, 2021, 3:53pm

it is not clear what you expect, but if you do not find better, try to adapt a scheme of the following type to your case

d = DataFrame(A = ["a", "b","a", "b"], B = ["no", "yes", "no", "no"], C = ["low", "low", "hi", "hi"])
gdd=groupby(dd,:A)
dx=Dict("no"=>0,"yes"=>0)
dy=Dict("hi"=>0,"low"=>0)
comb=combine(gdd,[:B,:C].=>countmap.=>[:Bb,:Cc])
tr=transform(comb,[:Bb,:Cc]=>ByRow((x,y)->[merge(dx,x),merge(dy,y)])=>[:Bb,:Cc])
transform(tr,[:Bb,:Cc].=>identity=>AsTable)

pdeffebach · June 13, 2021, 5:58pm

Here is something pretty good. Maybe someone can come up with something better, though.

julia> @chain dd begin 
           @aside v = unique(dd.b)
           groupby(:a)
           @combine b_countmap = begin 
               d = countmap(:b)
               for vi in v
                   get!(d, vi, 0)
               end
               d
           end
           flatten(:b_countmap)
           transform(:b_countmap => ByRow(b -> (b_value = first(b), b_count = last(b))) => AsTable)
           select(Not(:b_countmap))
       end
4×3 DataFrame
 Row │ a       b_value  b_count 
     │ String  String   Int64   
─────┼──────────────────────────
   1 │ a       yes            0
   2 │ a       no             2
   3 │ b       yes            1
   4 │ b       no             1

jules · June 13, 2021, 6:06pm

Isn’t that getting close to the normal

@chain df begin
    groupby([:a, :b])
    combine(nrow => :count)
end

pdeffebach · June 13, 2021, 6:08pm

Yeah it is lol. this is the correct answer.

jules · June 13, 2021, 6:10pm

I’ve had this mental twist before, I’m thinking about groups of a, and then counts of instances of b, but really it’s counts of groups of [a, b]

bkamins · June 13, 2021, 6:13pm

Yes, and then:

@chain df begin
    groupby([:a, :b])
    combine(nrow => :count)
    groupby(:a)
    combine(:count => (x -> x / sum(x)) => :prop)
end

to get proportions. At some point we will add add proprow and rownumber by bkamins · Pull Request #2556 · JuliaData/DataFrames.jl · GitHub.

jules · June 13, 2021, 6:41pm

And :b and :c at the same time can probably only really be handled by stacking them, because they don’t correspond to each other:

@chain dd begin
    stack([:b, :c])
    groupby([:a, :variable, :value])
    combine(nrow => :count)
end

7×4 DataFrame
 Row │ a       variable  value   count 
     │ String  String    String  Int64 
─────┼─────────────────────────────────
   1 │ a       b         no          2
   2 │ b       b         yes         1
   3 │ b       b         no          1
   4 │ a       c         lo          1
   5 │ b       c         lo          1
   6 │ a       c         hi          1
   7 │ b       c         hi          1

vjd · June 14, 2021, 12:20am

Indeed, it is a straightforward stack + nrow and the computing the ratio of n/ntotal. Thank you all. This is what I was looking for

Topic		Replies	Views
Creating a 3d frequency array for categorical variables from a dataframe New to Julia dataframes	3	682	February 12, 2021
Efficiently finding the frequency of patterns in DataFrame columns New to Julia dataframes , dictionaries , splitapplycombine	12	1551	January 1, 2022
Dataframes: Split combined result to different columns General Usage dataframes	3	310	December 13, 2021
DataFrames: Most efficient way to compute statistics on multiple/nested subgroups General Usage dataframes	2	1154	August 20, 2021
Counts of unique values per group in a DataFrame Data question , dataframes	3	10201	May 25, 2020

Compute frequency or proportions on grouped dataframes

Related topics