# DataFrames: Most efficient way to compute statistics on multiple/nested subgroups

I have a dataset with population statistics similar to this:

``````using DataFrames
#make a dummy data set
df = DataFrame(date = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
age_group = [1, 2, 2, 3, 4, 4, 5, 1, 2, 3, 4, 5, 5, 5, 5],
count = [10, 10, 20, 20, 10, 10, 30, 10, 20, 30, 10, 20, 30, 40, 50])
``````
``````15×3 DataFrame
Row │ date   age_group  count
│ Int64  Int64      Int64
─────┼─────────────────────────
1 │     1          1     10
2 │     1          2     10
3 │     1          2     20
4 │     1          3     20
⋮  │   ⋮        ⋮        ⋮
13 │     2          5     30
14 │     2          5     40
``````

So multiple count entries per date and age group.
My goal is to calculate the total percentage of each age group from the total population for each day.

Currently I’m doing:

``````#calculate statistics on the two groups
date_stats =  combine(groupby(df,["date"]), df -> DataFrame(date_sum = sum(df.count))) #total count per day
age_group_stats =  combine(groupby(df,["date","age_group"]), df -> DataFrame(age_group_sum = sum(df.count))) #agegroup count per day

df = innerjoin(date_stats, age_group_stats, on = :date)
df[:, :age_percentage] = df.age_group_sum./df.date_sum
``````

Giving me:

``````10×5 DataFrame
Row │ date   date_sum  age_group  age_group_sum  age_percentage
│ Int64  Int64     Int64      Int64          Float64
─────┼───────────────────────────────────────────────────────────
1 │     1       110          1             10       0.0909091
2 │     1       110          2             30       0.272727
3 │     1       110          3             20       0.181818
4 │     1       110          4             20       0.181818
⋮  │   ⋮       ⋮          ⋮            ⋮              ⋮
8 │     2       210          3             30       0.142857
9 │     2       210          4             10       0.047619
10 │     2       210          5            140       0.666667
3 rows omitted
``````

Since I’m not very familiar with DataFrames.jl I’m sure there is a much neater way to do this with less allocations (I don’t need the` age_group_sum` afterwards), less lines (maybe one single` groupby` call?). Any tips are highly appreciated.

``````agg = combine(groupby(df, [:date, :age_group]), :count => sum => :count)
I’m not following your problem exactly, but the `@by` macro does the grouping and transformation in one step.
You can also use `Chain.jl` (exported by DataFramesMeta) to avoid having to name multiple output data frames.