DataFrame groupby-aggregate strategy

Python:

df.groupby(["Z"])["A","B"].agg({"A":"max"})

I want to groupby “Z” column and take the maximum of “A” in each group. Values of Column “B”
is the row when A is max.

I have used the following way but it is not the result I want.

combine(groupby(df,:Z),:A=>maximum=>:A,:B=>:B)

May I know the Julia way to do this? Thanks

depending on what’s your expected behavior,

I do not withdraw this post.

The way I read his question the answer from jling does not seem right. But maybe I misunderstood.

maybe df3 in my code below is what the original poster wants

using Random 
using DataFrames 
Random.seed!(2)

df=DataFrame(rand(16,3))
rename!(df,[:A,:B,:Z])
df.Z .= round.(df.Z)

df2=combine(groupby(df,:Z),:A=>maximum=>:A)
df3=leftjoin(df2,df,on=[:A,:Z])
disallowmissing!(df3)

df4=combine(groupby(df,:Z),:A=>maximum=>:A,:B=>first=>:B)

isequal(df4,df3)
1 Like

I think I read the question the same way as you - however I was surprised that pandas would include the values of the other rows when using the aggregation function OP gave, and it looks like it doesn’t:

>>> df = pd.DataFrame({"a": [8, 2, 3, 1, 9, 3], "b": [11, 12, 13, 14, 15, 16], "c": ['a', 'a', 'a', 'b', 'b', 'b']})
>>> df
   a   b  c
0  8  11  a
1  2  12  a
2  3  13  a
3  1  14  b
4  9  15  b
5  3  16  b
>>> df.groupby(["c"])["a","b"].agg({"a":"max"})
<stdin>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
   a    
   a   b
c       
a  8  13
b  9  16

so here it looks like you just get last for column b, irrespective of where maximum(a) is actually located - in which case this could be replicated by using :B => last => :B in DataFrames.jl

Isn’t there another way without using a join?
For example getting the index of the (A) maximum (instead of the maximum itself) and then use that index to retrieve the (A) maximum and also B.

If I understand your problem correctly this is the way to do it. If you want to get the first maximum write:

combine(sdf -> sdf[argmax(sdf.a), [;a, :b]], groupby(df, :c))

and if you want all rows:

combine(sdf -> sdf[findall(==maximum(sdf.a), sdf.a), [;a, :b]], groupby(df, :c))
4 Likes

@bkamins also gave a great solution on Slack which I’ll add here so it doesn’t get swallowed by the Slack memory hole (as I’m sure I’ll be looking for this at some point in the near future):

julia> df = DataFrame(a = [8, 2, 3, 1, 9, 3], b = [11, 12, 13, 14, 15, 16], c = ['a', 'a', 'a', 'b', 'b', 'b'])
6×3 DataFrame
│ Row │ a     │ b     │ c    │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1   │ 8     │ 11    │ 'a'  │
│ 2   │ 2     │ 12    │ 'a'  │
│ 3   │ 3     │ 13    │ 'a'  │
│ 4   │ 1     │ 14    │ 'b'  │
│ 5   │ 9     │ 15    │ 'b'  │
│ 6   │ 3     │ 16    │ 'b'  │

julia> combine(groupby(df, :c), :a => maximum => :a, [:a, :b] => ((a,b) -> b[argmax(a)])  => :b)
2×3 DataFrame
│ Row │ c    │ a     │ b     │
│     │ Char │ Int64 │ Int64 │
├─────┼──────┼───────┼───────┤
│ 1   │ 'a'  │ 8     │ 11    │
│ 2   │ 'b'  │ 9     │ 15    │

so by passing [:a, :b] to the combine call we can create a two argument anonymous function

7 Likes