How DataFramesMeta @by works if you want to groupby two columns?

Hi all, I was reading the DataFramesMeta docu trying to solve a small issue that I have, but I couldn’t find the answer. Any help is welcome!

Basically, I have a df that looks like this:

df = DataFrame(
            a = repeat(1:4, outer = 2),
            b = ["a", "b", "c", "d", "e", "f", "g", "h"],
            c = [23,76,9,90,123,67,13,5])

And what I am looking for is getting a df that only holds the maximum value of c for each repeated value of a, and the value of b that correspond to that row. Something like this:

|a|  b|  c|
|-|---|---|
|1|"e"|123|
|2|"b"| 76|
|3|"g"| 13|
|4|"d"| 90|

I already know that I can use this line to get the first and last column:

@by(df, :a, :c= maximum(:c))

But I don’t really know how to add the corresponding line of b for each value of a. I tried a couple of things, but none worked.

Is there an easy way of doing this?

Thanks a lot!

I don’t use DataFramesMeta but in base DataFrames you can write

julia> combine(groupby(df, :a), 
    :c => maximum => :c, 
    [:b, :c] => ((b, c) -> b[findmax(c)[2]]) => :b)
4Γ—3 DataFrame
 Row β”‚ a      c      b
     β”‚ Int64  Int64  String
─────┼──────────────────────
   1 β”‚     1    123  e
   2 β”‚     2     76  b
   3 β”‚     3     13  g
   4 β”‚     4     90  d
1 Like

You can use the @astable macro-flag for this.

julia> df = DataFrame(
                   a = repeat(1:4, outer = 2),
                   b = ["a", "b", "c", "d", "e", "f", "g", "h"],
                   c = [23,76,9,90,123,67,13,5]);

julia> @by df :a @astable begin 
           maxval, ind = findmax(:c)
           :c = maxval
           :b = :b[ind]
       end
4Γ—3 DataFrame
 Row β”‚ a      c      b      
     β”‚ Int64  Int64  String 
─────┼──────────────────────
   1 β”‚     1    123  e
   2 β”‚     2     76  b
   3 β”‚     3     13  g
   4 β”‚     4     90  d
5 Likes

Also you can just do:

julia> @by(df, :a, :c = maximum(:c), :b = :b[argmax(:c)])
4Γ—3 DataFrame
 Row β”‚ a      c      b
     β”‚ Int64  Int64  String
─────┼──────────────────────
   1 β”‚     1    123  e
   2 β”‚     2     76  b
   3 β”‚     3     13  g
   4 β”‚     4     90  d

(it will be slower than what @pdeffebach proposed, but maybe simpler to read)

3 Likes

Ah of course, I also should have used argmax(c) instead of findmax(c)[2].

2 Likes

Thank you all!! All of the answers work for me.

Best!
Juan

g=groupby(df,:a)
combine(g, [:b,:c]=>((x,y)->[argmax(last,zip(x,y))])=>[:b,:c])

argmax(f, domain)

Return a value x in the domain of f for which f(x) is maximised.

I made a couple of attempts to use dataframesmeta, but the second failed.

julia> @by df :a @astable begin
           x, y = argmax(last,zip(:b,:c))
           :b=x
           :c=y
       end
4Γ—3 DataFrame
 Row β”‚ a      b       c     
     β”‚ Int64  String  Int64
─────┼──────────────────────
   1 β”‚     1  e         123
   2 β”‚     2  b          76
   3 β”‚     3  g          13
   4 β”‚     4  d          90

julia> @by df :a @astable begin
           :b, :c = argmax(last,zip(:b,:c))
       end
0Γ—1 DataFrame

this works (or almost)

@by df :a @astable :bc=argmax(last,zip(:b,:c))

# 4Γ—2 DataFrame
#  Row β”‚ a      bc
#      β”‚ Int64  Tuple…
# ─────┼───────────────────
#    1 β”‚     1  ("e", 123)
#    2 β”‚     2  ("b", 76)
#    3 β”‚     3  ("g", 13)
#    4 β”‚     4  ("d", 90)

Yes,

:b, :c = argmax(last,zip(:b,:c))

is not supported.

I think the docs are clear here:

In a single block, all assignments of the form :y = f(:x) or $y = f(:x) at the top-level generate new columns. In the second form, y must be a string or Symbol.

It says nothing about tuple assignment.

If you think this would be a good feature, please file an issue to track it. There might be an implementation that tracks all assignments and keeps track of them, but the implementation was difficult so I make no guarantees it will work.

I don’t have a very strong belief on the subject.
I don’t know how β€œimportant” this use case is.
However, it seems to me a β€œnatural” extension of the basic syntax.
I have no idea how much complexity implementing such a feature implies.
If the idea is implementable, it should be implemented in such a way as to be able to manage a generic numeric of variables.
I insert the example of 3 variables to make the idea …

julia> df = DataFrame(    a = repeat(1:4, outer = 2),
                          b = ["a", "b", "c", "d", "e", "f", "g", "h"],   
                          c=rand(1:10,8),
                          d = [23,76,9,90,123,67,13,5])
8Γ—4 DataFrame
 Row β”‚ a      b       c      d     
     β”‚ Int64  String  Int64  Int64
─────┼─────────────────────────────
   1 β”‚     1  a           1     23
   2 β”‚     2  b           3     76
   3 β”‚     3  c           6      9
   4 β”‚     4  d          10     90
   5 β”‚     1  e           1    123
   6 β”‚     2  f           7     67
   7 β”‚     3  g           9     13
   8 β”‚     4  h          10      5

julia> 

julia> @by df :a @astable begin
              :d=argmax(last,zip(:b,:c,:d))
          end
4Γ—2 DataFrame
 Row β”‚ a      d
     β”‚ Int64  Tuple…
─────┼──────────────────────
   1 β”‚     1  ("e", 1, 123)
   2 β”‚     2  ("b", 3, 76)
   3 β”‚     3  ("g", 9, 13)
   4 β”‚     4  ("d", 10, 90)

julia> @by df :a @astable begin
              x=argmax(last,zip(:b,:c,:d))
              :b=x[1]
              :c=x[2]
              :d=x[3]
          end
4Γ—4 DataFrame
 Row β”‚ a      b       c      d     
     β”‚ Int64  String  Int64  Int64
─────┼─────────────────────────────
   1 β”‚     1  e           1    123
   2 β”‚     2  b           3     76
   3 β”‚     3  g           9     13
   4 β”‚     4  d          10     90

julia> @by df :a @astable begin
             :b,:c,:d=argmax(last,zip(:b,:c,:d))
          end
0Γ—1 DataFrame
# In any case, this should give a warning about the correct use of the syntax

Hm weird I thought DataFramesMeta.jl and DataFrameMacros.jl worked the same way here, but apparently not?

using DataFrameMacros

df = DataFrame(a = repeat(1:4, outer = 2),
               b = ["a", "b", "c", "d", "e", "f", "g", "h"],   
               c=rand(1:10,8),
               d = [23,76,9,90,123,67,13,5])

julia> @combine groupby(df, :a) @astable :b,:c,:d = argmax(last,zip(:b,:c,:d))
4Γ—4 DataFrame
 Row β”‚ a      b       c      d     
     β”‚ Int64  String  Int64  Int64 
─────┼─────────────────────────────
   1 β”‚     1  e           8    123
   2 β”‚     2  b           5     76
   3 β”‚     3  g           7     13
   4 β”‚     4  d           1     90