How to select top 5 results per group in dataframe?

conditionality · September 23, 2020, 6:49pm

I’m looking for a Julian way of selecting a subset of each group. I have a DataFrame with (among others) two columns, say name and length. I want to group over all names and pick the 5 tallest people within each name. I tried this, but it does not return the correct result:

df = ...
sort!(df, [:length])
df2 = df |> @groupby([:name]) |> @take(5) |> collect
print(DataFrame(df2))

Changing collect to DataFrame does not work either. The print will tell me I have a dataframe with as many rows as the initial dataframe df. This sort of thing; taking a df → grouping it → selecting a subset of the rows of the groups → recombining the selected rows into a dataframe, is something I would assume is a common thing to do.

pdeffebach · September 23, 2020, 7:19pm

using just DataFrames and Pipe you have

using DataFrames, Pipe
@pipe df |>
    groupby(_, :name) |>
    combine(_) do sdf
        sorted = sort(df, :length)
        first(sorted, 5)
    end

conditionality · September 23, 2020, 7:42pm

Thanks! It feels like the Julian way of doing things is to know which library to use

Do you mean to use “df” or “sdf” on line 5?

pdeffebach · September 23, 2020, 7:48pm

sorry, I mean sdf. good catch

Topic		Replies	Views
How to get the first row of each group of a DataFrame? New to Julia	7	4427	December 20, 2022
DataFrame group by first column, and sort by last column Data dataframes	4	1894	October 16, 2018
How would I sort user transactions by date, and only keep the N most recent entries General Usage question	8	911	February 6, 2020
DataFramesMeta custom filter: by groups of A, apply filter on B Data	4	1418	May 27, 2019
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018

How to select top 5 results per group in dataframe?

Related topics