DataFrames: Most efficient way to compute statistics on multiple/nested subgroups

I have a dataset with population statistics similar to this:

using DataFrames
#make a dummy data set
df = DataFrame(date = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2], 
               age_group = [1, 2, 2, 3, 4, 4, 5, 1, 2, 3, 4, 5, 5, 5, 5], 
               count = [10, 10, 20, 20, 10, 10, 30, 10, 20, 30, 10, 20, 30, 40, 50])
15×3 DataFrame
 Row │ date   age_group  count 
     │ Int64  Int64      Int64 
─────┼─────────────────────────
   1 │     1          1     10
   2 │     1          2     10
   3 │     1          2     20
   4 │     1          3     20
  ⋮  │   ⋮        ⋮        ⋮
  13 │     2          5     30
  14 │     2          5     40

So multiple count entries per date and age group.
My goal is to calculate the total percentage of each age group from the total population for each day.

Currently I’m doing:

#calculate statistics on the two groups
date_stats =  combine(groupby(df,["date"]), df -> DataFrame(date_sum = sum(df.count))) #total count per day
age_group_stats =  combine(groupby(df,["date","age_group"]), df -> DataFrame(age_group_sum = sum(df.count))) #agegroup count per day

# join and add/calculate percentage
df = innerjoin(date_stats, age_group_stats, on = :date)
df[:, :age_percentage] = df.age_group_sum./df.date_sum

Giving me:

10×5 DataFrame
 Row │ date   date_sum  age_group  age_group_sum  age_percentage 
     │ Int64  Int64     Int64      Int64          Float64        
─────┼───────────────────────────────────────────────────────────
   1 │     1       110          1             10       0.0909091
   2 │     1       110          2             30       0.272727
   3 │     1       110          3             20       0.181818
   4 │     1       110          4             20       0.181818
  ⋮  │   ⋮       ⋮          ⋮            ⋮              ⋮
   8 │     2       210          3             30       0.142857
   9 │     2       210          4             10       0.047619
  10 │     2       210          5            140       0.666667
                                                   3 rows omitted

Since I’m not very familiar with DataFrames.jl I’m sure there is a much neater way to do this with less allocations (I don’t need the age_group_sum afterwards), less lines (maybe one single groupby call?). Any tips are highly appreciated.

How about:

agg = combine(groupby(df, [:date, :age_group]), :count => sum => :count) 
transform(groupby(agg, :date), :count => (x -> x / sum(x)) => :age_percentage)

DataFramesMeta.jl is also very useful for things.

I’m not following your problem exactly, but the @by macro does the grouping and transformation in one step.

You can also use Chain.jl (exported by DataFramesMeta) to avoid having to name multiple output data frames.