How DataFramesMeta @by works if you want to groupby two columns?

Juan_Mac_Donagh · October 26, 2022, 2:37pm

Hi all, I was reading the DataFramesMeta docu trying to solve a small issue that I have, but I couldn’t find the answer. Any help is welcome!

Basically, I have a df that looks like this:

df = DataFrame(
            a = repeat(1:4, outer = 2),
            b = ["a", "b", "c", "d", "e", "f", "g", "h"],
            c = [23,76,9,90,123,67,13,5])

And what I am looking for is getting a df that only holds the maximum value of c for each repeated value of a, and the value of b that correspond to that row. Something like this:

|a|  b|  c|
|-|---|---|
|1|"e"|123|
|2|"b"| 76|
|3|"g"| 13|
|4|"d"| 90|

I already know that I can use this line to get the first and last column:

@by(df, :a, :c= maximum(:c))

But I don’t really know how to add the corresponding line of b for each value of a. I tried a couple of things, but none worked.

Is there an easy way of doing this?

Thanks a lot!

nilshg · October 26, 2022, 3:05pm

I don’t use DataFramesMeta but in base DataFrames you can write

julia> combine(groupby(df, :a), 
    :c => maximum => :c, 
    [:b, :c] => ((b, c) -> b[findmax(c)[2]]) => :b)
4×3 DataFrame
 Row │ a      c      b
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     1    123  e
   2 │     2     76  b
   3 │     3     13  g
   4 │     4     90  d

pdeffebach · October 26, 2022, 3:09pm

You can use the @astable macro-flag for this.

julia> df = DataFrame(
                   a = repeat(1:4, outer = 2),
                   b = ["a", "b", "c", "d", "e", "f", "g", "h"],
                   c = [23,76,9,90,123,67,13,5]);

julia> @by df :a @astable begin 
           maxval, ind = findmax(:c)
           :c = maxval
           :b = :b[ind]
       end
4×3 DataFrame
 Row │ a      c      b      
     │ Int64  Int64  String 
─────┼──────────────────────
   1 │     1    123  e
   2 │     2     76  b
   3 │     3     13  g
   4 │     4     90  d

bkamins · October 26, 2022, 4:50pm

Also you can just do:

julia> @by(df, :a, :c = maximum(:c), :b = :b[argmax(:c)])
4×3 DataFrame
 Row │ a      c      b
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     1    123  e
   2 │     2     76  b
   3 │     3     13  g
   4 │     4     90  d

(it will be slower than what @pdeffebach proposed, but maybe simpler to read)

nilshg · October 26, 2022, 5:28pm

Ah of course, I also should have used argmax(c) instead of findmax(c)[2].

Juan_Mac_Donagh · October 26, 2022, 6:51pm

Thank you all!! All of the answers work for me.

Best!
Juan

rocco_sprmnt21 · October 30, 2022, 6:06pm

g=groupby(df,:a)
combine(g, [:b,:c]=>((x,y)->[argmax(last,zip(x,y))])=>[:b,:c])

argmax(f, domain)

Return a value x in the domain of f for which f(x) is maximised.

rocco_sprmnt21 · October 30, 2022, 6:26pm

I made a couple of attempts to use dataframesmeta, but the second failed.

julia> @by df :a @astable begin
           x, y = argmax(last,zip(:b,:c))
           :b=x
           :c=y
       end
4×3 DataFrame
 Row │ a      b       c     
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  e         123
   2 │     2  b          76
   3 │     3  g          13
   4 │     4  d          90

julia> @by df :a @astable begin
           :b, :c = argmax(last,zip(:b,:c))
       end
0×1 DataFrame

this works (or almost)

@by df :a @astable :bc=argmax(last,zip(:b,:c))

# 4×2 DataFrame
#  Row │ a      bc
#      │ Int64  Tuple…
# ─────┼───────────────────
#    1 │     1  ("e", 123)
#    2 │     2  ("b", 76)
#    3 │     3  ("g", 13)
#    4 │     4  ("d", 90)

pdeffebach · October 30, 2022, 10:02pm

Yes,

:b, :c = argmax(last,zip(:b,:c))

is not supported.

I think the docs are clear here:

In a single block, all assignments of the form :y = f(:x) or $y = f(:x) at the top-level generate new columns. In the second form, y must be a string or Symbol.

It says nothing about tuple assignment.

If you think this would be a good feature, please file an issue to track it. There might be an implementation that tracks all assignments and keeps track of them, but the implementation was difficult so I make no guarantees it will work.

rocco_sprmnt21 · October 31, 2022, 7:55am

I don’t have a very strong belief on the subject.
I don’t know how “important” this use case is.
However, it seems to me a “natural” extension of the basic syntax.
I have no idea how much complexity implementing such a feature implies.
If the idea is implementable, it should be implemented in such a way as to be able to manage a generic numeric of variables.
I insert the example of 3 variables to make the idea …

julia> df = DataFrame(    a = repeat(1:4, outer = 2),
                          b = ["a", "b", "c", "d", "e", "f", "g", "h"],   
                          c=rand(1:10,8),
                          d = [23,76,9,90,123,67,13,5])
8×4 DataFrame
 Row │ a      b       c      d     
     │ Int64  String  Int64  Int64
─────┼─────────────────────────────
   1 │     1  a           1     23
   2 │     2  b           3     76
   3 │     3  c           6      9
   4 │     4  d          10     90
   5 │     1  e           1    123
   6 │     2  f           7     67
   7 │     3  g           9     13
   8 │     4  h          10      5

julia> 

julia> @by df :a @astable begin
              :d=argmax(last,zip(:b,:c,:d))
          end
4×2 DataFrame
 Row │ a      d
     │ Int64  Tuple…
─────┼──────────────────────
   1 │     1  ("e", 1, 123)
   2 │     2  ("b", 3, 76)
   3 │     3  ("g", 9, 13)
   4 │     4  ("d", 10, 90)

julia> @by df :a @astable begin
              x=argmax(last,zip(:b,:c,:d))
              :b=x[1]
              :c=x[2]
              :d=x[3]
          end
4×4 DataFrame
 Row │ a      b       c      d     
     │ Int64  String  Int64  Int64
─────┼─────────────────────────────
   1 │     1  e           1    123
   2 │     2  b           3     76
   3 │     3  g           9     13
   4 │     4  d          10     90

julia> @by df :a @astable begin
             :b,:c,:d=argmax(last,zip(:b,:c,:d))
          end
0×1 DataFrame
# In any case, this should give a warning about the correct use of the syntax

jules · October 31, 2022, 10:33am

Hm weird I thought DataFramesMeta.jl and DataFrameMacros.jl worked the same way here, but apparently not?

using DataFrameMacros

df = DataFrame(a = repeat(1:4, outer = 2),
               b = ["a", "b", "c", "d", "e", "f", "g", "h"],   
               c=rand(1:10,8),
               d = [23,76,9,90,123,67,13,5])

julia> @combine groupby(df, :a) @astable :b,:c,:d = argmax(last,zip(:b,:c,:d))
4×4 DataFrame
 Row │ a      b       c      d     
     │ Int64  String  Int64  Int64 
─────┼─────────────────────────────
   1 │     1  e           8    123
   2 │     2  b           5     76
   3 │     3  g           7     13
   4 │     4  d           1     90

Topic		Replies	Views
DataFrame groupby-aggregate strategy General Usage question	7	5910	June 27, 2020
Way to return multiple columns after applying combine-groupby transformation New to Julia question , dataframes	15	787	February 27, 2024
Get value of column corresponding to condition being met for another column in a Grouped DataFrame New to Julia data , dataframes	1	816	October 11, 2021
Simplifying this dataframe aggregation operation General Usage	19	2265	September 9, 2018
Index of max values of N columns in a DataFrame without manually entering each column New to Julia question , indexing , dataframes , maxima	15	1511	January 7, 2023

How DataFramesMeta @by works if you want to groupby two columns?

Related topics