How do I get the first row from each group of a grouped DataFrame?
Say, I want to find the sex and height of the tallest person in each age-group:
using DataFrames
df = DataFrame(sex = repeat(["F","M"], inner=4, outer=2), age = repeat(14:2:20,outer=4), height=100 .+100 .*rand(16))
16×3 DataFrame
│ Row │ sex │ age │ height │
│ │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1 │ F │ 14 │ 127.878 │
│ 2 │ F │ 16 │ 133.063 │
│ 3 │ F │ 18 │ 186.6 │
│ 4 │ F │ 20 │ 133.589 │
│ 5 │ M │ 14 │ 186.655 │
│ 6 │ M │ 16 │ 144.436 │
│ 7 │ M │ 18 │ 155.293 │
│ 8 │ M │ 20 │ 181.719 │
│ 9 │ F │ 14 │ 199.308 │
│ 10 │ F │ 16 │ 114.783 │
│ 11 │ F │ 18 │ 111.551 │
│ 12 │ F │ 20 │ 121.64 │
│ 13 │ M │ 14 │ 177.85 │
│ 14 │ M │ 16 │ 170.868 │
│ 15 │ M │ 18 │ 126.169 │
│ 16 │ M │ 20 │ 125.229 │
I tried this:
sort!(df, :height, rev=true)
first(groupby(df, :age))
4×3 SubDataFrame
│ Row │ sex │ age │ height │
│ │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1 │ F │ 14 │ 199.308 │
│ 2 │ M │ 14 │ 186.655 │
│ 3 │ M │ 14 │ 177.85 │
│ 4 │ F │ 14 │ 127.878 │
But this gives me the first group, not the first row from each group.
Then my instinct would be to “dot the first
”, but then I get:
first.(groupby(df, :age))
ERROR: ArgumentError: broadcasting over `GroupedDataFrame`s is reserved
In R I would do:
df <- data.frame(sex = rep(c("F","M"),2, each =4), age=rep(seq(14,20,2),4), height = 100 + 100*runif(16))
df <- plyr::arrange(df, plyr::desc(height))
plyr::ddply(df, ~ age, head, 1)
sex age height
1 M 14 193.4705
2 F 16 199.1906
3 M 18 165.1674
4 F 20 177.7445
What is the idiomatic way in DataFrames?