How to select top 5 results per group in dataframe?

I’m looking for a Julian way of selecting a subset of each group. I have a DataFrame with (among others) two columns, say name and length. I want to group over all names and pick the 5 tallest people within each name. I tried this, but it does not return the correct result:

df = ...
sort!(df, [:length])
df2 = df |> @groupby([:name]) |> @take(5) |> collect
print(DataFrame(df2))

Changing collect to DataFrame does not work either. The print will tell me I have a dataframe with as many rows as the initial dataframe df. This sort of thing; taking a df → grouping it → selecting a subset of the rows of the groups → recombining the selected rows into a dataframe, is something I would assume is a common thing to do.

1 Like

using just DataFrames and Pipe you have

using DataFrames, Pipe
@pipe df |>
    groupby(_, :name) |>
    combine(_) do sdf
        sorted = sort(df, :length)
        first(sorted, 5)
    end
4 Likes

Thanks! It feels like the Julian way of doing things is to know which library to use :slight_smile:

Do you mean to use “df” or “sdf” on line 5?

sorry, I mean sdf. good catch