DataFrames: Most efficient way to compute statistics on multiple/nested subgroups

told · August 20, 2021, 8:00am

I have a dataset with population statistics similar to this:

using DataFrames
#make a dummy data set
df = DataFrame(date = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2], 
               age_group = [1, 2, 2, 3, 4, 4, 5, 1, 2, 3, 4, 5, 5, 5, 5], 
               count = [10, 10, 20, 20, 10, 10, 30, 10, 20, 30, 10, 20, 30, 40, 50])

15×3 DataFrame
 Row │ date   age_group  count 
     │ Int64  Int64      Int64 
─────┼─────────────────────────
   1 │     1          1     10
   2 │     1          2     10
   3 │     1          2     20
   4 │     1          3     20
  ⋮  │   ⋮        ⋮        ⋮
  13 │     2          5     30
  14 │     2          5     40

So multiple count entries per date and age group.
My goal is to calculate the total percentage of each age group from the total population for each day.

Currently I’m doing:

#calculate statistics on the two groups
date_stats =  combine(groupby(df,["date"]), df -> DataFrame(date_sum = sum(df.count))) #total count per day
age_group_stats =  combine(groupby(df,["date","age_group"]), df -> DataFrame(age_group_sum = sum(df.count))) #agegroup count per day

# join and add/calculate percentage
df = innerjoin(date_stats, age_group_stats, on = :date)
df[:, :age_percentage] = df.age_group_sum./df.date_sum

Giving me:

10×5 DataFrame
 Row │ date   date_sum  age_group  age_group_sum  age_percentage 
     │ Int64  Int64     Int64      Int64          Float64        
─────┼───────────────────────────────────────────────────────────
   1 │     1       110          1             10       0.0909091
   2 │     1       110          2             30       0.272727
   3 │     1       110          3             20       0.181818
   4 │     1       110          4             20       0.181818
  ⋮  │   ⋮       ⋮          ⋮            ⋮              ⋮
   8 │     2       210          3             30       0.142857
   9 │     2       210          4             10       0.047619
  10 │     2       210          5            140       0.666667
                                                   3 rows omitted

Since I’m not very familiar with DataFrames.jl I’m sure there is a much neater way to do this with less allocations (I don’t need the age_group_sum afterwards), less lines (maybe one single groupby call?). Any tips are highly appreciated.

elbersb · August 20, 2021, 11:41am

How about:

agg = combine(groupby(df, [:date, :age_group]), :count => sum => :count) 
transform(groupby(agg, :date), :count => (x -> x / sum(x)) => :age_percentage)

pdeffebach · August 20, 2021, 1:33pm

DataFramesMeta.jl is also very useful for things.

I’m not following your problem exactly, but the @by macro does the grouping and transformation in one step.

You can also use Chain.jl (exported by DataFramesMeta) to avoid having to name multiple output data frames.

Topic		Replies	Views
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1111	December 14, 2020
Calculation question New to Julia question , dataframes	4	603	November 29, 2021
Compute frequency or proportions on grouped dataframes Data dataframes	10	1165	June 14, 2021
Efficient computation of statistics across multiple data frames Data dataframes	9	351	January 31, 2024
DataFrames.jl: keep group by result in every row General Usage dataframes	4	1685	February 9, 2020

DataFrames: Most efficient way to compute statistics on multiple/nested subgroups

Related topics