Balancing groups in DataFrame

Quick question: suppose I have a DataFrame of the form

df = DataFrame(grouping1 = [1, 1, 2, 2], grouping2 = [true, true, true, false])

Captura de pantalla 2024-08-16 a la(s) 20.07.20

Even though, for the grouping1 category: 1, there are no observations with grouping2 value: false, I would like to then count observations forcing a homogenous nesting across grouping1 values so that when I do something along the lines of:

df = combine(DataFrames.groupby(df, [:grouping1, :grouping2]), nrow)

Instead of getting this:
Captura de pantalla 2024-08-16 a la(s) 20.09.59
I get what I refer to a balanced df where fro grouping1 value 1, grouping2 has two values too and false would apear to have 0 counts.

This seems like useful when structuring data into nests and I thought there would be some function/option already available but I cant seem to find one.

1 Like

:person_shrugging:

julia> gdf = groupby(df, :grouping1);

julia> combine(
           gdf,
           :grouping2 => count => :trues,
           :grouping2 => (count ∘ .!) => :falses,
       )
2Γ—3 DataFrame
 Row β”‚ grouping1  trues  falses
     β”‚ Int64      Int64  Int64
─────┼──────────────────────────
   1 β”‚         1      2       0
   2 β”‚         2      1       1

I would just create a new data frame that has all the combinations present. You can use the function allcombinations for this

julia> df = DataFrame(grouping1 = [1, 1, 2, 2], grouping2 = [true, true, true, false])
4Γ—2 DataFrame
 Row β”‚ grouping1  grouping2 
     β”‚ Int64      Bool      
─────┼──────────────────────
   1 β”‚         1       true
   2 β”‚         1       true
   3 β”‚         2       true
   4 β”‚         2      false

julia> df_complete = allcombinations(DataFrame, grouping1 = [1, 2], grouping2 = [true, false])
4Γ—2 DataFrame
 Row β”‚ grouping1  grouping2 
     β”‚ Int64      Bool      
─────┼──────────────────────
   1 β”‚         1       true
   2 β”‚         2       true
   3 β”‚         1      false
   4 β”‚         2      false

julia> df_collapsed = @chain df begin
           groupby([:grouping1, :grouping2])
           combine(nrow)
           leftjoin(df_complete, _, on = [:grouping1, :grouping2])
           @transform :nrow = replace(:nrow, missing => 0)
       end
4Γ—3 DataFrame
 Row β”‚ grouping1  grouping2  nrow  
     β”‚ Int64      Bool       Int64 
─────┼─────────────────────────────
   1 β”‚         1       true      2
   2 β”‚         2       true      1
   3 β”‚         2      false      1
   4 β”‚         1      false      0
1 Like

If you are willing to live with missing instead of zero counts you can achieve this with

julia> unstack(
           combine(
                   groupby(
                         df,
                        [:grouping1, :grouping2],
                   ),
                   nrow => :n,
           ),
           :grouping2,
           :n,
    )
2Γ—3 DataFrame
 Row β”‚ grouping1  true    false   
     β”‚ Int64      Int64?  Int64?  
─────┼────────────────────────────
   1 β”‚         1       2  missing 
   2 β”‚         2       1        1

I guess to get the result the OP wants you would need to stack after unstacking.

It’s a bit more compact to use column numbers instead of names in this case.

julia> stack(unstack(combine(groupby(df, 1:2), nrow => :n), 2, :n), 2:3)
4Γ—3 DataFrame
 Row β”‚ grouping1  variable  value   
     β”‚ Int64      String    Int64?  
─────┼──────────────────────────────
   1 β”‚         1  true            2
   2 β”‚         2  true            1
   3 β”‚         1  false     missing 
   4 β”‚         2  false           1
1 Like

Another option, which might be simpler for non-experts of the DataFrames’ mini-language:

df_all = allcombinations(DataFrame, grouping1=[1, 2], grouping2=[true, false])
df_all.count = [count(values(c)==values(r) for r in eachrow(df)) for c in eachrow(df_all)]
df_all