Index of max values of N columns in a DataFrame without manually entering each column

Hi pretty sure I’m overlooking something simple here but would appreciate some pointers.

Suppose I have the following DataFrame

df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
 Row │ idA    idB    C      idD   
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │   -76     80     11     13
   2 │    35      1     93    -24
   3 │   -50    -79     80     -1
   4 │   -41     36    -50     89
   5 │    27   -100     34     37

I can get an ranked index of the maximum values of the ‘id’ columns as per @Dan’s solution here Sort DataFrame by the greater of multiple columns with

sortperm(max.(df.idA,df.idB,df.idD),rev=true)
5-element Vector{Int64}:
 4
 1
 5
 2
 3

I can then use this index to rank the DataFrame by max values of the selected columns like (again from @Dan 's answer)

permute!(df, sortperm(max.(df.A, df.B); rev=true))

For a large number of id columns however it would be easier to collect the id columns without explicitly listing each column. What would be the best approach to get the array of (df.idA,df.idB,df.idD) without having to manually list them?

As a work around I tried

select(df, r"id")

However this creates a new DataFrame so broadcasting max via the . operator returns

max.(select(df,r"id"))
5×3 DataFrame
 Row │ idA    idB    idD   
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │   -76     80     13
   2 │    35      1    -24
   3 │   -50    -79     -1
   4 │   -41     36     89
   5 │    27   -100     37

as opposed to the desired result of

max.(df.idA,df.idB,df.idD)
5-element Vector{Int64}:
 80
 35
 -1
 89
 37

converting the DataFrame to Arrays via Matrix

max.(Matrix(select(df,r"id")))
5×3 Matrix{Int64}:
 -76    80   13
  35     1  -24
 -50   -79   -1
 -41    36   89
  27  -100   37

also did not work so I switched to

maximum.(eachrow(select(df,r"id")))
5-element Vector{Int64}:
 80
 35
 -1
 89
 37

However I noticed that while very fast, it is orders of magnitude slower than the initial max approach. (I’m assuming this has to do with type stability between columns?)

# maximum approach
@benchmark maximum.(eachrow(select(df,r"id")))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.312 μs …  13.681 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.444 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.486 μs ± 196.884 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄▇██ █▆▄ ▁                                              
  ▁▂▄▇▅████████▆█▇▆▅▃▄▄▃▂▃▃▃▃▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  5.31 μs         Histogram: frequency by time        6.04 μs <

 Memory estimate: 3.27 KiB, allocs estimate: 59.
# max approach
@benchmark max.(df.idA,df.idB,df.idD)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
 Range (min … max):  477.888 ns … 665.393 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     489.796 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   491.454 ns ±   8.238 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▄▃▃█▄▄▅▁                                           
  ▁▁▁▂▂▂▃▃▄▇▇███████████▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  478 ns           Histogram: frequency by time          521 ns <

 Memory estimate: 160 bytes, allocs estimate: 3.

converting the DataFrame to Arrays does not seem to shrink the difference much

@benchmark maximum.(eachrow(Matrix(select(df,r"id"))))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.714 μs …  10.755 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.833 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.867 μs ± 140.935 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃█▁█▆▅ ▁        ▁                                       
  ▁▂▃▄███████▆█▇▆▃▆▇█▅██▇▄▆▆▄▂▄▃▃▂▃▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  3.71 μs         Histogram: frequency by time        4.25 μs <

 Memory estimate: 3.09 KiB, allocs estimate: 43.

So I was just wondering 1) is there a way to collect the id columns and then broadcast max? 2) what is happening with max.(df) so that the initial DataFrame is being returned? I looked through the documentation but am still unsure about it so any insights would be greatly appreciated. Thanks!

1 Like

The following is one way:

julia> maximum(Matrix(select(df,r"id")); dims=2)
5×1 Matrix{Int64}:
 71
 24
 34
 76
 90

and now sortperm of this vector will give the desired order (with rev).

Another way is using:

julia> select(df, r"id" => ByRow(max) => :maxid)

which gives the above vector and:

julia> df[sortperm(select(df, r"id" => ByRow(max) => :maxid); rev=true),:]
5×4 DataFrame
 Row │ idA    idB    C      idD   
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │    90    -64     89     -8
   2 │   -80     76    -37    -58
   3 │    71    -50     99    -99
   4 │    34    -12      6    -76
   5 │    24    -88    -69    -91

does the reordering.

This also works:

julia> select(df, AsTable(r"id") => (t->maximum(t)) => :maxid)
5×1 DataFrame
 Row │ maxid 
     │ Int64 
─────┼───────
   1 │    71
   2 │    24
   3 │    34
   4 │   -80
   5 │    90

UPDATE: The following contained an error originally, and used maximum instead of max. as parameter to orderby. The fixed statement is:

Finally, using DataFramesMeta, it is rather compact:

julia> using DataFramesMeta

julia> @orderby df -max.(AsTable(r"id")...)
5×4 DataFrame
 Row │ idA    idB    C      idD   
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │    90    -64     89     -8
   2 │    71    -50     99    -99
   3 │    34    -12      6    -76
   4 │    24    -88    -69    -91
   5 │   -80     76    -37    -58
2 Likes

Thanks so much for saving the day again! I really appreciate it.

I’m just curious because all these DataFrame methods add a few orders of magnitude compared to your original approach.

df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
 Row │ idA    idB    C      idD   
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │    76     95     77     56
   2 │    33     77     20     30
   3 │   -92     78     11     92
   4 │     3    -68     60    -43
   5 │   -42     39     43    -88

So even applying maximum to a matrix

@benchmark maximum(Matrix(select(df,r"id")); dims=2)

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.031 μs … 329.958 μs  ┊ GC (min … max): 0.00% … 98.43%
 Time  (median):     3.109 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.273 μs ±   6.510 μs  ┊ GC (mean ± σ):  3.95% ±  1.97%

      ▂ ▇█▅▃                                                   
  ▁▁▃▅█▇████▅▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  3.03 μs         Histogram: frequency by time        3.59 μs <

 Memory estimate: 3.03 KiB, allocs estimate: 49.

Takes orders of magnitude Thant your original method of just listing the df id columns and broadcasting max (though this delta shrink with an increase in rows.)

@benchmark max.(df.idA,df.idB,df.idD)
BenchmarkTools.Trial: 10000 samples with 197 evaluations.
 Range (min … max):  451.569 ns …  10.453 μs  ┊ GC (min … max): 0.00% … 95.39%
 Time  (median):     457.487 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   465.382 ns ± 236.886 ns  ┊ GC (mean ± σ):  1.24% ±  2.33%

        ▁ ▅▃█▃ ▁                                                 
  ▁▁▂▃▅▅█▇██████▄▅▃▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  452 ns           Histogram: frequency by time          483 ns <

 Memory estimate: 160 bytes, allocs estimate: 3.

Is there a way to gather all the id columns so that your original approach of sortperm(max.(df.idA,df.idB,df.idD), rev=true) can be used? Would some kind of comprehension with names(df, r"id") to reference the Dataframe columns be possible or advisable?

Yes:

julia> @benchmark max.(eachcol($df[!, r"id"])...)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.189 μs … 310.356 μs  ┊ GC (min … max): 0.00% … 98.75%
 Time  (median):     2.444 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.743 μs ±   5.897 μs  ┊ GC (mean ± σ):  4.26% ±  1.97%

   ▁▆ █▇▂▂
  ▂███████▆▃▃▃▃▃▃▃▂▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  2.19 μs         Histogram: frequency by time        4.61 μs <

 Memory estimate: 2.16 KiB, allocs estimate: 31.

Note that most of the cost is in doing column lookup and selection via regex as:

julia> df2 = df[!, r"id"];

julia> @benchmark max.(eachcol($df2)...)
BenchmarkTools.Trial: 10000 samples with 169 evaluations.
 Range (min … max):  621.302 ns …  14.826 μs  ┊ GC (min … max): 0.00% … 94.98%
 Time  (median):     675.148 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   714.059 ns ± 338.369 ns  ┊ GC (mean ± σ):  1.35% ±  2.80%

     █▅
  ▃▆▇██▆▆▄▃▃▅▆██▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  621 ns           Histogram: frequency by time         1.16 μs <

 Memory estimate: 272 bytes, allocs estimate: 7.

In general my question is why for such small tables you try to optimize performance?

1 Like

Sorry I’m sure I’m doing something wrong here but I get the following output

max.(eachcol($df[!, r"id"])...)
ERROR: syntax: "$" expression outside quote around REPL[26]:1
Stacktrace:
 [1] top-level scope
   @ REPL[26]:1

Also why do we need the splat operator here?

Yes thank you very much for pointing that out. The actual tables are exponentially larger but I think there is also a heavy component of curiosity and lack of experience that causes this unnecessary tinkering. I’ll be more cognizant of it in the future!

For this problem, the {{ }} syntax of DataFrameMacros.jl can be convenient. With that, you get a tuple of the referenced columns spliced into your expression, which you can call maximum on, and - it for reverse order:

julia> using DataFrameMacros

julia> df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
 Row │ idA    idB    C      idD
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │    33    -45    -55     36
   2 │    58     21     -1     -5
   3 │    34    -98     93     26
   4 │   -83     25    -84     28
   5 │   -84     97     21    -98

julia> @sort(df, -maximum({{r"id"}}))
5×4 DataFrame
 Row │ idA    idB    C      idD
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │   -84     97     21    -98
   2 │    58     21     -1     -5
   3 │    33    -45    -55     36
   4 │    34    -98     93     26
   5 │   -83     25    -84     28

Edit: Ah I overlooked the similar DataFramesMeta solution posted above, seems like they added a special cased AsTable recently (second edit, well not that recently, a year ago)

@orderby df -maximum(AsTable(r"id"))
2 Likes

Keep in mind though that performance on very small tables might not be indicative of performance on large tables, especially when using DataFrames machinery like select/combine/transform etc. which might have some fixed overhead which can be amortized over large tables.

2 Likes

$ is only needed when benchmarking, see GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language.

Because max takes multiple positional arguments and not one iterable argument (like maximum).

You misunderstood me. This was a valid question. If your tables are much larger then the cost of extra selection will be negligible, and your issue will disappear therefore using DataFrame and regex selector is OK (as they will have very small cost relative to the cost of actual computations). However, if you had a lot of (like millions ) small tables then the situation is different and another data structure should be used. But I understand your problem is the former.

2 Likes

Exactly - this was the reason of my question about the kind of data @phantom actually works with.

1 Like

Without prejudice to all the observations on the advisability of this aggressive treatment, here is another way of obtaining the same result in a perhaps faster way (it should also be verified for different table sizes)

max.([df[!,c] for c in names(df) if c!=("C")]...)

or

max.([df[!,c] for c in filter(n->startswith(n,"id"), names(df) )]...)
3 Likes

Thank you all again for being so generous with your insight and knowledge. It is incredibly helpful and much appreciated. I marked @Dan 's original answer due to the simplicity in the code and as pointed out by bkamins and nilshg the delta in competition time doesn’t necessarily carry over into larger tables. However, I might be wrong, but is it the case that the DataFramesMeta solution

should be modified to

 
@orderby(df7, -max.(AsTable(r"id")...))

I am likely missing something here, but in the following example each of these solutions generate the same DataFrame

df7 = DataFrame(Aid = rand(-100:100, 10000000), Bid = rand(-100:100, 10000000), C = rand(-100:100,10000000), Did = rand(-100:100,10000000))
mxbyrow = df7[sortperm(select(df7, r"id" => ByRow(max) => :maxid); rev=true),:]

mxclmn = df7[sortperm(max.(df7.Aid, df7.Bid, df7.Did),rev = true), :]

mxeachclmn = df7[sortperm(max.(eachcol(df7[!, r"id"])...),rev = true), :]

mxfltr = df7[ sortperm(max.([df7[!,c] for c in filter(n-> endswith(n,"id"), names(df7))]...), rev=true),:]

using DataFrameMacros

mxmacro =  @sort(df7, -maximum({{r"id"}}))

mxfltr == mxclmn == mxeachclmn == mxmacro == mxbyrow
true

However with the DataFramesMeta approach

using DataFrames
mxmeta =  @orderby df7 -maximum(AsTable(r"id"))

mxclmn == mxmeta
false

but if I use the syntax from @bkamins previous post here Sort DataFrame by the greater of multiple columns

mxmeta2  =  @orderby(df7, -max.(AsTable(r"id")...))

then

mxclmn == mxmeta2
true

so

@orderby(df7, -max.(AsTable(r"id")...))  == @orderby df7 -maximum(AsTable(r"id"))
false

Is this correct? I am probably missing something but I can’t quite figure out how @orderby df7 -maximum(AsTable(r"id")) is ranking the rows.

Just as an aside in terms of speed, if the rows are ranked in ascending rather than descending order, it seems that the non macro approaches are still a bit faster.

 @benchmark @orderby($df7, max.(AsTable(r"id")...))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  967.029 ms …   1.118 s  ┊ GC (min … max):  0.00% … 13.52%
 Time  (median):        1.105 s              ┊ GC (median):    12.17%
 Time  (mean ± σ):      1.078 s ± 62.659 ms  ┊ GC (mean ± σ):  10.06% ±  5.54%

  █                                                █    ██   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁█ ▁
  967 ms          Histogram: frequency by time          1.12 s <

 Memory estimate: 2.24 GiB, allocs estimate: 2957653.

julia> @benchmark $df7[sortperm(max.(eachcol($df7[!, r"id"])...)), :]
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min … max):  193.540 ms … 233.784 ms  ┊ GC (min … max): 0.00% … 16.63%
 Time  (median):     196.598 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.730 ms ±  10.667 ms  ┊ GC (mean ± σ):  1.98% ±  4.84%

  ▁▄▁█▁    ▁▁                                                    
  █████▆▆▁▆██▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▆ ▁
  194 ms           Histogram: frequency by time          234 ms <

 Memory estimate: 457.77 MiB, allocs estimate: 59.

max. (with dot) should be used. maximum takes a collection and returns its maximal element (whole vectors are processed). See the difference:

julia> x = [3, 7, 1]
3-element Vector{Int64}:
 3
 7
 1

julia> y = [4, 6, 3]
3-element Vector{Int64}:
 4
 6
 3

julia> z = [5, 3, 2]
3-element Vector{Int64}:
 5
 3
 2

julia> max.(x, y, z)
3-element Vector{Int64}:
 5
 7
 3

julia> maximum([x,y,z])
3-element Vector{Int64}:
 5
 3
 2
1 Like

Got it thanks!

Sorry just writing it out for future reference


@orderby df -maximum(AsTable(r"id"))

  1. Selects the id column in the DataFrame with the largest element in the first row. (Aid in the example below)

  2. Orders the rows of the DataFrame by ranking the elements of that one column in ascending maximum or descending -maximum order.

df = DataFrame(Aid = rand(1:10, 4), Bid = rand(1:10, 4), C = rand(1:10,4))
4×3 DataFrame
 Row │ Aid    Bid    C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     6      2      9
   2 │     5      6      2
   3 │     9      5      3
   4 │     2      7      8

@orderby df -maximum(AsTable(r"id"))
# selects Aid because df.Aid[1] > df.Bid[1]
# ranks df according to Aid 

4×3 DataFrame
 Row │ Aid    Bid    C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     9      5      3
   2 │     6      2      9
   3 │     5      6      2
   4 │     2      7      8

whereas

@orderby(df, -max.(AsTable(r"id")...))

Orders the rows of the DataFrame by ranking the max elements in each row of all id columns.

julia> @orderby(dft, -max.(AsTable(r"id")...))
4×3 DataFrame
 Row │ Aid    Bid    C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     9      5      3
   2 │     2      7      8
   3 │     6      2      9
   4 │     5      6      2

and in case of tie second row etc. as long as there is a tie (this is called lexicographic order).

1 Like

I’ve fixed my post above according to this (use max. instead of maximum).
The source of the confusion, is that maximum can be used for row-wise maximum by adding the dims=2 named parameter on a matrix argument (at the top of my post).

2 Likes

or maximum for row-wise maximum using the zip function

maximum.(zip([df[!,c] for c in names(df) if c!="C"]...));
df[sortperm([maximum(r) for r in zip([df[!,c] for c in filter(!=("C"), names(df) )]...)], lt= >= ),:]

PS

it seems that the last form has problems.

1 Like