DataFrame group by first column, and sort by last column

Hi,

Probably a simple question, but with DataFrames.jl I’m having trouble returning the data frame grouped by the first column, the sorted by the last column. Here’s a MWE

 using DataFrames
  A = rand(5,5)
  A[:,1] = [1, 1, 2, 2, 4]
  df = DataFrame(A)                                                                                                  
  df2 = by(df, :x1, df -> minimum(df.x2))

Output is:

julia> df
5×5 DataFrame
│ Row │ x1      │ x2       │ x3        │ x4       │ x5       │
│     │ Float64 │ Float64  │ Float64   │ Float64  │ Float64  │
├─────┼─────────┼──────────┼───────────┼──────────┼──────────┤
│ 1   │ 1.0     │ 0.684006 │ 0.27617   │ 0.453495 │ 0.701109 │
│ 2   │ 1.0     │ 0.282817 │ 0.94968   │ 0.531294 │ 0.262201 │
│ 3   │ 2.0     │ 0.802795 │ 0.0950535 │ 0.538556 │ 0.155517 │
│ 4   │ 2.0     │ 0.986956 │ 0.609984  │ 0.633382 │ 0.169541 │
│ 5   │ 4.0     │ 0.975459 │ 0.876686  │ 0.105175 │ 0.221114 │

julia> df2
3×2 DataFrame
│ Row │ x1      │ x1_1     │
│     │ Float64 │ Float64  │
├─────┼─────────┼──────────┤
│ 1   │ 1.0     │ 0.282817 │
│ 2   │ 2.0     │ 0.802795 │
│ 3   │ 4.0     │ 0.975459 │

So in the above, df2 is a sorted with just the first and last column, but I’d like to return all columns. Anyone know how to go about doing that?

Thanks!

So you are ok to have only one row per group, right? What values would you want to see for the other columns? Also use the minimum aggregation per group, or something else?

that’s right, one value per group. I dont want the minimum of each column, but the row corresponding to the minimum of the right-most column for each group

I think this should work:

by(df, :x1) do d
   DataFrame([minimum(d[name]) for name in names(d)]', names(d)) 
   # for every group, get a row vector of the minimums for that group
  # make a DataFrame out of it, (which is kinda expensive), and give 
  # it the names of the sub-dataframe you are acting on. 
end

The Query.jl way would be this:

df |> @groupby(_.x1) |> @map(first(sort(_, by=i->i.x2))) |> DataFrame

If we get https://github.com/JuliaLang/julia/issues/28210 into Base eventually, it would simplify to:

df |> @groupby(_.x1) |> @map(minimum(_, by=i->i.x2)) |> DataFrame

Here is how this works: the first @groupby creates a stream of three groups, where each group is a table itself. A table is just an array of rows (named tuples). So calling sort on that will sort the rows, and then first will pick the first row from each group.