Way to return multiple columns after applying combine-groupby transformation

JBH_84 · February 26, 2024, 9:12pm

I have the following df that i apply a transformation on. it only returns columns y and x however. is there a way to also return the corresponding values in columns a and b?

df=DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 1, 1, 1],b=[2, 2, 2, 2])
combine(groupby(df,[:y]), [:x] .=> maximum; renamecols=false)

jar1 · February 26, 2024, 9:17pm

groupby(df, [:y, :a, :b])

JBH_84 · February 26, 2024, 9:47pm

sorry. i should have clarified: i dont want to groupby :a and :b. i only want the corresponding values from those columns when i groupby :y and then find the maximum value of :x for each unique value of :y

jar1 · February 26, 2024, 10:26pm

I don’t love this but it works

julia> combine(groupby(df, :y), 
    [:x, :a, :b] => ((x, a, b) -> let i = argmax(x); (;x=x[i], a=a[i], b=b[i]) end) => AsTable)
2×4 DataFrame
 Row │ y      x      a      b     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      4      1      2
   2 │     2      6      1      2

Dan · February 26, 2024, 11:31pm

Another option:

DataFrame([argmax(t->(t.x),eachrow(q)) for q in groupby(df, [:y])])

giving:

2×4 DataFrame
 Row │ y      x      a      b     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      4      1      2
   2 │     2      6      1      2

kdpsingh · February 27, 2024, 12:29am

My preferred approach:

using Statistics

df |>
    x -> groupby(x, :y) |>
    x -> subset(x, :x => (x -> x .== maximum(x)))

2×4 DataFrame
 Row │ y      x      a      b     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      4      1      2
   2 │     2      6      1      2

Or using TidierData:

using TidierData
using Statistics

@chain df begin
    @group_by(y)
    @filter(x == maximum(x))
    @ungroup
end

# The ungroup is required because only `@summarize()` operations ungroup data frames automatically.

2×4 DataFrame
 Row │ y      x      a      b     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      4      1      2
   2 │     2      6      1      2

JBH_84 · February 27, 2024, 1:43am

Wow! Thank you. Very elegant

George9000 · February 27, 2024, 1:50am

let
    df = DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 1, 1, 1],b=[2, 2, 2, 2])
    gdf = groupby(df, :y)
    unique(transform(gdf, :x => maximum, renamecols = false))
end

Dan · February 27, 2024, 6:37pm

This will not work properly if the a and b fields are not constant as in this example (as the rows won’t be replicas).

If there are multiple lines attaining the maximum value in x, it might output more than one line per group. It isn’t clear what the OP wants, but it may pose a problem.

Another method (which isn’t sensitive to above issues):

julia> DataFrame(last(g) for g in groupby(sort(df, [:y, :x]),:y))
2×4 DataFrame
 Row │ y      x      a      b     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      4      1      2
   2 │     2      6      1      2

(but this depends on sort stability of groupby within groups, which seems to work, but I’m not sure is guaranteed).

rocco_sprmnt21 · February 27, 2024, 6:55pm

combine(gr->gr[gr.x .== maximum(gr.x),:],groupby(df,:y))

Dan · February 27, 2024, 7:01pm

And another version, (inspired by rocco’s post):

combine(last, groupby(sort(df, [:y, :x]),:y))

(doesn’t return multuple rows/group in case of ties)

rocco_sprmnt21 · February 27, 2024, 7:02pm

I believe that the basic operation of the combine(groupby(df,[:cols…]), …) function limits the output to the key columns and the columns acted upon by the func() because there is no a unique criterion for choosing which values of the other columns to output.
In the case in question the problem does not exist since :a and :b have a constant value.
So you could do this to get the desired result.

combine(groupby(df,[:y,:a,:b]),:x=>maximum)

kdpsingh · February 27, 2024, 7:13pm

If you don’t want ties, then it’s:

@chain df begin
    @group_by(y)
    @filter(x == maximum(x))
    @slice(1) # for first value, or @slice(n()) for last
    @ungroup
end

Or, for short:

@chain df begin
    @group_by(y)
    @slice_max(x, with_ties = false)
    @ungroup
end

rocco_sprmnt21 · February 27, 2024, 8:28pm

combine(groupby(df,:y),:x=>maximum,[:a,:b].=>first.=>[:a,:b])


combine(groupby(df,:y),:x=>maximum,Not([:x,:y]).=>first.=>identity )

Dan · February 27, 2024, 9:04pm

Doesn’t work quite right. Consider:

df=DataFrame(y=[1, 1, 2, 2],x=[3, 4, 5, 6],a=[1, 2,3,4],b=[2, 3,4,5])

rocco_sprmnt21 · February 27, 2024, 9:27pm

Yes, certainly. This form is only for the specific case proposed as an example by the OP. This is exactly what I was referring to in the previous comment. Any choice of non-key, non-transformed column values is arbitrary.

Topic		Replies	Views
DataFrame groupby-aggregate strategy General Usage question	7	5910	June 27, 2020
Get value of column corresponding to condition being met for another column in a Grouped DataFrame New to Julia data , dataframes	1	816	October 11, 2021
How DataFramesMeta @by works if you want to groupby two columns? New to Julia dataframes , dataframesmeta	10	774	October 31, 2022
Combine rows but keep all columns Data dataframes	9	144	November 14, 2024
Combining a col from each DF group into a single DF New to Julia question , dataframes	5	296	August 25, 2022

Way to return multiple columns after applying combine-groupby transformation

Related topics