I have a dataset with population statistics similar to this:
using DataFrames
#make a dummy data set
df = DataFrame(date = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
age_group = [1, 2, 2, 3, 4, 4, 5, 1, 2, 3, 4, 5, 5, 5, 5],
count = [10, 10, 20, 20, 10, 10, 30, 10, 20, 30, 10, 20, 30, 40, 50])
15×3 DataFrame
Row │ date age_group count
│ Int64 Int64 Int64
─────┼─────────────────────────
1 │ 1 1 10
2 │ 1 2 10
3 │ 1 2 20
4 │ 1 3 20
⋮ │ ⋮ ⋮ ⋮
13 │ 2 5 30
14 │ 2 5 40
So multiple count entries per date and age group.
My goal is to calculate the total percentage of each age group from the total population for each day.
Currently I’m doing:
#calculate statistics on the two groups
date_stats = combine(groupby(df,["date"]), df -> DataFrame(date_sum = sum(df.count))) #total count per day
age_group_stats = combine(groupby(df,["date","age_group"]), df -> DataFrame(age_group_sum = sum(df.count))) #agegroup count per day
# join and add/calculate percentage
df = innerjoin(date_stats, age_group_stats, on = :date)
df[:, :age_percentage] = df.age_group_sum./df.date_sum
Giving me:
10×5 DataFrame
Row │ date date_sum age_group age_group_sum age_percentage
│ Int64 Int64 Int64 Int64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ 1 110 1 10 0.0909091
2 │ 1 110 2 30 0.272727
3 │ 1 110 3 20 0.181818
4 │ 1 110 4 20 0.181818
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
8 │ 2 210 3 30 0.142857
9 │ 2 210 4 10 0.047619
10 │ 2 210 5 140 0.666667
3 rows omitted
Since I’m not very familiar with DataFrames.jl I’m sure there is a much neater way to do this with less allocations (I don’t need the age_group_sum
afterwards), less lines (maybe one single groupby
call?). Any tips are highly appreciated.