DataFrame group by first column, and sort by last column

mbeach42 · October 15, 2018, 6:04pm

Hi,

Probably a simple question, but with DataFrames.jl I’m having trouble returning the data frame grouped by the first column, the sorted by the last column. Here’s a MWE

 using DataFrames
  A = rand(5,5)
  A[:,1] = [1, 1, 2, 2, 4]
  df = DataFrame(A)                                                                                                  
  df2 = by(df, :x1, df -> minimum(df.x2))

Output is:

julia> df
5×5 DataFrame
│ Row │ x1      │ x2       │ x3        │ x4       │ x5       │
│     │ Float64 │ Float64  │ Float64   │ Float64  │ Float64  │
├─────┼─────────┼──────────┼───────────┼──────────┼──────────┤
│ 1   │ 1.0     │ 0.684006 │ 0.27617   │ 0.453495 │ 0.701109 │
│ 2   │ 1.0     │ 0.282817 │ 0.94968   │ 0.531294 │ 0.262201 │
│ 3   │ 2.0     │ 0.802795 │ 0.0950535 │ 0.538556 │ 0.155517 │
│ 4   │ 2.0     │ 0.986956 │ 0.609984  │ 0.633382 │ 0.169541 │
│ 5   │ 4.0     │ 0.975459 │ 0.876686  │ 0.105175 │ 0.221114 │

julia> df2
3×2 DataFrame
│ Row │ x1      │ x1_1     │
│     │ Float64 │ Float64  │
├─────┼─────────┼──────────┤
│ 1   │ 1.0     │ 0.282817 │
│ 2   │ 2.0     │ 0.802795 │
│ 3   │ 4.0     │ 0.975459 │

So in the above, df2 is a sorted with just the first and last column, but I’d like to return all columns. Anyone know how to go about doing that?

Thanks!

davidanthoff · October 15, 2018, 7:32pm

So you are ok to have only one row per group, right? What values would you want to see for the other columns? Also use the minimum aggregation per group, or something else?

mbeach42 · October 15, 2018, 7:52pm

that’s right, one value per group. I dont want the minimum of each column, but the row corresponding to the minimum of the right-most column for each group

pdeffebach · October 15, 2018, 8:43pm

I think this should work:

by(df, :x1) do d
   DataFrame([minimum(d[name]) for name in names(d)]', names(d)) 
   # for every group, get a row vector of the minimums for that group
  # make a DataFrame out of it, (which is kinda expensive), and give 
  # it the names of the sub-dataframe you are acting on. 
end

davidanthoff · October 16, 2018, 12:26am

The Query.jl way would be this:

df |> @groupby(_.x1) |> @map(first(sort(_, by=i->i.x2))) |> DataFrame

If we get https://github.com/JuliaLang/julia/issues/28210 into Base eventually, it would simplify to:

df |> @groupby(_.x1) |> @map(minimum(_, by=i->i.x2)) |> DataFrame

Here is how this works: the first @groupby creates a stream of three groups, where each group is a table itself. A table is just an array of rows (named tuples). So calling sort on that will sort the rows, and then first will pick the first row from each group.

Topic		Replies	Views
How to get the first row of each group of a DataFrame? New to Julia	7	4537	December 20, 2022
Create grouped dataframe by properties of a given column? New to Julia dataframes , grouped-data	9	450	April 26, 2024
How to select top 5 results per group in dataframe? New to Julia	3	1204	September 23, 2020
Sorting a Table / DataFrame General Usage dataframes	12	888	September 6, 2023
Sort rows in a dataframe based on a predefined order New to Julia sort , dataframes	5	1905	September 17, 2021

DataFrame group by first column, and sort by last column

Related topics