How to get the first row of each group of a DataFrame?

How do I get the first row from each group of a grouped DataFrame?

Say, I want to find the sex and height of the tallest person in each age-group:

using DataFrames
df = DataFrame(sex = repeat(["F","M"], inner=4, outer=2), age = repeat(14:2:20,outer=4), height=100 .+100 .*rand(16))
16×3 DataFrame
│ Row │ sex    │ age   │ height  │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ F      │ 14    │ 127.878 │
│ 2   │ F      │ 16    │ 133.063 │
│ 3   │ F      │ 18    │ 186.6   │
│ 4   │ F      │ 20    │ 133.589 │
│ 5   │ M      │ 14    │ 186.655 │
│ 6   │ M      │ 16    │ 144.436 │
│ 7   │ M      │ 18    │ 155.293 │
│ 8   │ M      │ 20    │ 181.719 │
│ 9   │ F      │ 14    │ 199.308 │
│ 10  │ F      │ 16    │ 114.783 │
│ 11  │ F      │ 18    │ 111.551 │
│ 12  │ F      │ 20    │ 121.64  │
│ 13  │ M      │ 14    │ 177.85  │
│ 14  │ M      │ 16    │ 170.868 │
│ 15  │ M      │ 18    │ 126.169 │
│ 16  │ M      │ 20    │ 125.229 │

I tried this:

sort!(df, :height, rev=true)
first(groupby(df, :age))
4×3 SubDataFrame
│ Row │ sex    │ age   │ height  │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ F      │ 14    │ 199.308 │
│ 2   │ M      │ 14    │ 186.655 │
│ 3   │ M      │ 14    │ 177.85  │
│ 4   │ F      │ 14    │ 127.878 │

But this gives me the first group, not the first row from each group.

Then my instinct would be to "dot the first", but then I get:

first.(groupby(df, :age))
ERROR: ArgumentError: broadcasting over `GroupedDataFrame`s is reserved

In R I would do:

df <- data.frame(sex = rep(c("F","M"),2, each =4), age=rep(seq(14,20,2),4), height = 100 + 100*runif(16))
df <- plyr::arrange(df, plyr::desc(height))
plyr::ddply(df, ~ age, head, 1)
  sex age   height
1   M  14 193.4705
2   F  16 199.1906
3   M  18 165.1674
4   F  20 177.7445

What is the idiomatic way in DataFrames?

1 Like

Here are three options to get it:

julia> combine(groupby(df, :age)) do sdf
       sdf[argmax(sdf.height), :]
       end
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ M      │ 179.193 │
│ 2   │ 16    │ M      │ 188.043 │
│ 3   │ 18    │ F      │ 181.994 │
│ 4   │ 20    │ M      │ 171.729 │

julia> combine(groupby(df, :age)) do sdf
       first(sort(sdf, :height, rev=true))
       end
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ M      │ 179.193 │
│ 2   │ 16    │ M      │ 188.043 │
│ 3   │ 18    │ F      │ 181.994 │
│ 4   │ 20    │ M      │ 171.729 │

julia> combine(first, groupby(sort(df, :height, rev=true), :age))
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 16    │ M      │ 188.043 │
│ 2   │ 18    │ F      │ 181.994 │
│ 3   │ 14    │ M      │ 179.193 │
│ 4   │ 20    │ M      │ 171.729 │

(the first one is fastest as it does not require sorting)

2 Likes

Wow that was quick!

Thank you so much @bkamins.
And thank you for DataFrames in general :smile: This is a great package :thumbsup:

The last solution is closest to my expectation:

combine(first, groupby(df, :age))
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ F      │ 199.308 │
│ 2   │ 18    │ F      │ 186.6   │
│ 3   │ 20    │ M      │ 181.719 │
│ 4   │ 16    │ M      │ 170.868 │

But the first two preserve the order of the age groups.

Thanks again!

1 Like

Exactly. Note that you can also pass sort to groupby if you want groups to be sorted.

1 Like