DataFrame groupby-aggregate strategy

POKIN_CHAN · June 27, 2020, 2:12pm

Python:

df.groupby(["Z"])["A","B"].agg({"A":"max"})

I want to groupby “Z” column and take the maximum of “A” in each group. Values of Column “B”
is the row when A is max.

I have used the following way but it is not the result I want.

combine(groupby(df,:Z),:A=>maximum=>:A,:B=>:B)

May I know the Julia way to do this? Thanks

jling · June 27, 2020, 2:44pm

depending on what’s your expected behavior,

POKIN_CHAN · June 27, 2020, 2:55pm

I do not withdraw this post.

bernhard · June 27, 2020, 6:44pm

The way I read his question the answer from jling does not seem right. But maybe I misunderstood.

maybe df3 in my code below is what the original poster wants

using Random 
using DataFrames 
Random.seed!(2)

df=DataFrame(rand(16,3))
rename!(df,[:A,:B,:Z])
df.Z .= round.(df.Z)

df2=combine(groupby(df,:Z),:A=>maximum=>:A)
df3=leftjoin(df2,df,on=[:A,:Z])
disallowmissing!(df3)

df4=combine(groupby(df,:Z),:A=>maximum=>:A,:B=>first=>:B)

isequal(df4,df3)

nilshg · June 27, 2020, 7:29pm

I think I read the question the same way as you - however I was surprised that pandas would include the values of the other rows when using the aggregation function OP gave, and it looks like it doesn’t:

>>> df = pd.DataFrame({"a": [8, 2, 3, 1, 9, 3], "b": [11, 12, 13, 14, 15, 16], "c": ['a', 'a', 'a', 'b', 'b', 'b']})
>>> df
   a   b  c
0  8  11  a
1  2  12  a
2  3  13  a
3  1  14  b
4  9  15  b
5  3  16  b
>>> df.groupby(["c"])["a","b"].agg({"a":"max"})
<stdin>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
   a    
   a   b
c       
a  8  13
b  9  16

so here it looks like you just get last for column b, irrespective of where maximum(a) is actually located - in which case this could be replicated by using :B => last => :B in DataFrames.jl

Juan · June 27, 2020, 7:45pm

Isn’t there another way without using a join?
For example getting the index of the (A) maximum (instead of the maximum itself) and then use that index to retrieve the (A) maximum and also B.

bkamins · June 27, 2020, 8:00pm

If I understand your problem correctly this is the way to do it. If you want to get the first maximum write:

combine(sdf -> sdf[argmax(sdf.a), [;a, :b]], groupby(df, :c))

and if you want all rows:

combine(sdf -> sdf[findall(==maximum(sdf.a), sdf.a), [;a, :b]], groupby(df, :c))

nilshg · June 27, 2020, 8:16pm

@bkamins also gave a great solution on Slack which I’ll add here so it doesn’t get swallowed by the Slack memory hole (as I’m sure I’ll be looking for this at some point in the near future):

julia> df = DataFrame(a = [8, 2, 3, 1, 9, 3], b = [11, 12, 13, 14, 15, 16], c = ['a', 'a', 'a', 'b', 'b', 'b'])
6×3 DataFrame
│ Row │ a     │ b     │ c    │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1   │ 8     │ 11    │ 'a'  │
│ 2   │ 2     │ 12    │ 'a'  │
│ 3   │ 3     │ 13    │ 'a'  │
│ 4   │ 1     │ 14    │ 'b'  │
│ 5   │ 9     │ 15    │ 'b'  │
│ 6   │ 3     │ 16    │ 'b'  │

julia> combine(groupby(df, :c), :a => maximum => :a, [:a, :b] => ((a,b) -> b[argmax(a)])  => :b)
2×3 DataFrame
│ Row │ c    │ a     │ b     │
│     │ Char │ Int64 │ Int64 │
├─────┼──────┼───────┼───────┤
│ 1   │ 'a'  │ 8     │ 11    │
│ 2   │ 'b'  │ 9     │ 15    │

so by passing [:a, :b] to the combine call we can create a two argument anonymous function

Topic		Replies	Views
Way to return multiple columns after applying combine-groupby transformation New to Julia question , dataframes	15	783	February 27, 2024
Get value of column corresponding to condition being met for another column in a Grouped DataFrame New to Julia data , dataframes	1	816	October 11, 2021
How DataFramesMeta @by works if you want to groupby two columns? New to Julia dataframes , dataframesmeta	10	773	October 31, 2022
Simplifying this dataframe aggregation operation General Usage	19	2254	September 9, 2018
Split-Apply-Combine on many columns at once? Looking for equivalent to Stata's collapse New to Julia question , dataframes	7	2031	November 9, 2021

DataFrame groupby-aggregate strategy

Related topics