How to get the first row of each group of a DataFrame?

How do I get the first row from each group of a grouped DataFrame?

Say, I want to find the sex and height of the tallest person in each age-group:

using DataFrames
df = DataFrame(sex = repeat(["F","M"], inner=4, outer=2), age = repeat(14:2:20,outer=4), height=100 .+100 .*rand(16))
16×3 DataFrame
│ Row │ sex    │ age   │ height  │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ F      │ 14    │ 127.878 │
│ 2   │ F      │ 16    │ 133.063 │
│ 3   │ F      │ 18    │ 186.6   │
│ 4   │ F      │ 20    │ 133.589 │
│ 5   │ M      │ 14    │ 186.655 │
│ 6   │ M      │ 16    │ 144.436 │
│ 7   │ M      │ 18    │ 155.293 │
│ 8   │ M      │ 20    │ 181.719 │
│ 9   │ F      │ 14    │ 199.308 │
│ 10  │ F      │ 16    │ 114.783 │
│ 11  │ F      │ 18    │ 111.551 │
│ 12  │ F      │ 20    │ 121.64  │
│ 13  │ M      │ 14    │ 177.85  │
│ 14  │ M      │ 16    │ 170.868 │
│ 15  │ M      │ 18    │ 126.169 │
│ 16  │ M      │ 20    │ 125.229 │

I tried this:

sort!(df, :height, rev=true)
first(groupby(df, :age))
4×3 SubDataFrame
│ Row │ sex    │ age   │ height  │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ F      │ 14    │ 199.308 │
│ 2   │ M      │ 14    │ 186.655 │
│ 3   │ M      │ 14    │ 177.85  │
│ 4   │ F      │ 14    │ 127.878 │

But this gives me the first group, not the first row from each group.

Then my instinct would be to “dot the first”, but then I get:

first.(groupby(df, :age))
ERROR: ArgumentError: broadcasting over `GroupedDataFrame`s is reserved

In R I would do:

df <- data.frame(sex = rep(c("F","M"),2, each =4), age=rep(seq(14,20,2),4), height = 100 + 100*runif(16))
df <- plyr::arrange(df, plyr::desc(height))
plyr::ddply(df, ~ age, head, 1)
  sex age   height
1   M  14 193.4705
2   F  16 199.1906
3   M  18 165.1674
4   F  20 177.7445

What is the idiomatic way in DataFrames?

1 Like

Here are three options to get it:

julia> combine(groupby(df, :age)) do sdf
       sdf[argmax(sdf.height), :]
       end
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ M      │ 179.193 │
│ 2   │ 16    │ M      │ 188.043 │
│ 3   │ 18    │ F      │ 181.994 │
│ 4   │ 20    │ M      │ 171.729 │

julia> combine(groupby(df, :age)) do sdf
       first(sort(sdf, :height, rev=true))
       end
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ M      │ 179.193 │
│ 2   │ 16    │ M      │ 188.043 │
│ 3   │ 18    │ F      │ 181.994 │
│ 4   │ 20    │ M      │ 171.729 │

julia> combine(first, groupby(sort(df, :height, rev=true), :age))
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 16    │ M      │ 188.043 │
│ 2   │ 18    │ F      │ 181.994 │
│ 3   │ 14    │ M      │ 179.193 │
│ 4   │ 20    │ M      │ 171.729 │

(the first one is fastest as it does not require sorting)

4 Likes

Wow that was quick!

Thank you so much @bkamins.
And thank you for DataFrames in general :smile: This is a great package :thumbsup:

The last solution is closest to my expectation:

combine(first, groupby(df, :age))
4×3 DataFrame
│ Row │ age   │ sex    │ height  │
│     │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 14    │ F      │ 199.308 │
│ 2   │ 18    │ F      │ 186.6   │
│ 3   │ 20    │ M      │ 181.719 │
│ 4   │ 16    │ M      │ 170.868 │

But the first two preserve the order of the age groups.

Thanks again!

1 Like

Exactly. Note that you can also pass sort to groupby if you want groups to be sorted.

1 Like

Is there a way to do the last one in DataFramesMeta using groupby and @combine ?

You mean this one?

@chain df begin
    groupby :age
    @combine $first
end

if the constraint on the use of combine is removed, a new (among others) possibility

gdf=groupby(df,:age)
subset(gdf, :height => x->x.== maximum(x))

PS
The “advantage” of this solution compared to the others with the use of combine is the management of the presence of any “homozygous twins” (therefore having the same height for genetic reasons). :grinning: