Way to return multiple columns after applying combine-groupby transformation

I have the following df that i apply a transformation on. it only returns columns y and x however. is there a way to also return the corresponding values in columns a and b?

df=DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 1, 1, 1],b=[2, 2, 2, 2])
combine(groupby(df,[:y]), [:x] .=> maximum; renamecols=false)

groupby(df, [:y, :a, :b])

sorry. i should have clarified: i dont want to groupby :a and :b. i only want the corresponding values from those columns when i groupby :y and then find the maximum value of :x for each unique value of :y

I don’t love this but it works

julia> combine(groupby(df, :y), 
    [:x, :a, :b] => ((x, a, b) -> let i = argmax(x); (;x=x[i], a=a[i], b=b[i]) end) => AsTable)
2Γ—4 DataFrame
 Row β”‚ y      x      a      b     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚     1      4      1      2
   2 β”‚     2      6      1      2

Another option:

DataFrame([argmax(t->(t.x),eachrow(q)) for q in groupby(df, [:y])])

giving:

2Γ—4 DataFrame
 Row β”‚ y      x      a      b     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚     1      4      1      2
   2 β”‚     2      6      1      2

My preferred approach:

using Statistics

df |>
    x -> groupby(x, :y) |>
    x -> subset(x, :x => (x -> x .== maximum(x)))
2Γ—4 DataFrame
 Row β”‚ y      x      a      b     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚     1      4      1      2
   2 β”‚     2      6      1      2

Or using TidierData:

using TidierData
using Statistics

@chain df begin
    @group_by(y)
    @filter(x == maximum(x))
    @ungroup
end

# The ungroup is required because only `@summarize()` operations ungroup data frames automatically.
2Γ—4 DataFrame
 Row β”‚ y      x      a      b     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚     1      4      1      2
   2 β”‚     2      6      1      2
1 Like

Wow! Thank you. Very elegant

1 Like
let
    df = DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 1, 1, 1],b=[2, 2, 2, 2])
    gdf = groupby(df, :y)
    unique(transform(gdf, :x => maximum, renamecols = false))
end

This will not work properly if the a and b fields are not constant as in this example (as the rows won’t be replicas).

If there are multiple lines attaining the maximum value in x, it might output more than one line per group. It isn’t clear what the OP wants, but it may pose a problem.

Another method (which isn’t sensitive to above issues):

julia> DataFrame(last(g) for g in groupby(sort(df, [:y, :x]),:y))
2Γ—4 DataFrame
 Row β”‚ y      x      a      b     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚     1      4      1      2
   2 β”‚     2      6      1      2

(but this depends on sort stability of groupby within groups, which seems to work, but I’m not sure is guaranteed).

combine(gr->gr[gr.x .== maximum(gr.x),:],groupby(df,:y))

And another version, (inspired by rocco’s post):

combine(last, groupby(sort(df, [:y, :x]),:y))

(doesn’t return multuple rows/group in case of ties)

1 Like

I believe that the basic operation of the combine(groupby(df,[:cols…]), …) function limits the output to the key columns and the columns acted upon by the func() because there is no a unique criterion for choosing which values of the other columns to output.
In the case in question the problem does not exist since :a and :b have a constant value.
So you could do this to get the desired result.

combine(groupby(df,[:y,:a,:b]),:x=>maximum)

If you don’t want ties, then it’s:

@chain df begin
    @group_by(y)
    @filter(x == maximum(x))
    @slice(1) # for first value, or @slice(n()) for last
    @ungroup
end

Or, for short:

@chain df begin
    @group_by(y)
    @slice_max(x, with_ties = false)
    @ungroup
end
1 Like
combine(groupby(df,:y),:x=>maximum,[:a,:b].=>first.=>[:a,:b])


combine(groupby(df,:y),:x=>maximum,Not([:x,:y]).=>first.=>identity )


Doesn’t work quite right. Consider:

df=DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 2,3,4],b=[2, 3,4,5])

Yes, certainly. This form is only for the specific case proposed as an example by the OP. This is exactly what I was referring to in the previous comment. Any choice of non-key, non-transformed column values is arbitrary.