# Index of max values of N columns in a DataFrame without manually entering each column

Hi pretty sure I’m overlooking something simple here but would appreciate some pointers.

Suppose I have the following `DataFrame`

``````df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │   -76     80     11     13
2 │    35      1     93    -24
3 │   -50    -79     80     -1
4 │   -41     36    -50     89
5 │    27   -100     34     37
``````

I can get an ranked index of the maximum values of the ‘id’ columns as per @Dan’s solution here Sort DataFrame by the greater of multiple columns with

``````sortperm(max.(df.idA,df.idB,df.idD),rev=true)
5-element Vector{Int64}:
4
1
5
2
3
``````

I can then use this index to rank the `DataFrame` by max values of the selected columns like (again from @Dan 's answer)

``````permute!(df, sortperm(max.(df.A, df.B); rev=true))
``````

For a large number of `id` columns however it would be easier to collect the `id` columns without explicitly listing each column. What would be the best approach to get the array of `(df.idA,df.idB,df.idD)` without having to manually list them?

As a work around I tried

``````select(df, r"id")
``````

However this creates a new `DataFrame` so broadcasting `max` via the `.` operator returns

``````max.(select(df,r"id"))
5×3 DataFrame
Row │ idA    idB    idD
│ Int64  Int64  Int64
─────┼─────────────────────
1 │   -76     80     13
2 │    35      1    -24
3 │   -50    -79     -1
4 │   -41     36     89
5 │    27   -100     37
``````

as opposed to the desired result of

``````max.(df.idA,df.idB,df.idD)
5-element Vector{Int64}:
80
35
-1
89
37
``````

converting the DataFrame to Arrays via `Matrix`

``````max.(Matrix(select(df,r"id")))
5×3 Matrix{Int64}:
-76    80   13
35     1  -24
-50   -79   -1
-41    36   89
27  -100   37
``````

also did not work so I switched to

``````maximum.(eachrow(select(df,r"id")))
5-element Vector{Int64}:
80
35
-1
89
37
``````

However I noticed that while very fast, it is orders of magnitude slower than the initial `max` approach. (I’m assuming this has to do with type stability between columns?)

``````# maximum approach
@benchmark maximum.(eachrow(select(df,r"id")))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max):  5.312 μs …  13.681 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     5.444 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   5.486 μs ± 196.884 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▄▇██ █▆▄ ▁
▁▂▄▇▅████████▆█▇▆▅▃▄▄▃▂▃▃▃▃▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
5.31 μs         Histogram: frequency by time        6.04 μs <

Memory estimate: 3.27 KiB, allocs estimate: 59.
``````
``````# max approach
@benchmark max.(df.idA,df.idB,df.idD)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
Range (min … max):  477.888 ns … 665.393 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     489.796 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   491.454 ns ±   8.238 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▄▃▃█▄▄▅▁
▁▁▁▂▂▂▃▃▄▇▇███████████▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
478 ns           Histogram: frequency by time          521 ns <

Memory estimate: 160 bytes, allocs estimate: 3.
``````

converting the `DataFrame` to Arrays does not seem to shrink the difference much

``````@benchmark maximum.(eachrow(Matrix(select(df,r"id"))))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max):  3.714 μs …  10.755 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     3.833 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   3.867 μs ± 140.935 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▃█▁█▆▅ ▁        ▁
▁▂▃▄███████▆█▇▆▃▆▇█▅██▇▄▆▆▄▂▄▃▃▂▃▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
3.71 μs         Histogram: frequency by time        4.25 μs <

Memory estimate: 3.09 KiB, allocs estimate: 43.
``````

So I was just wondering 1) is there a way to collect the `id` columns and then broadcast `max`? 2) what is happening with `max.(df)` so that the initial `DataFrame` is being returned? I looked through the documentation but am still unsure about it so any insights would be greatly appreciated. Thanks!

1 Like

The following is one way:

``````julia> maximum(Matrix(select(df,r"id")); dims=2)
5×1 Matrix{Int64}:
71
24
34
76
90
``````

and now `sortperm` of this vector will give the desired order (with `rev`).

Another way is using:

``````julia> select(df, r"id" => ByRow(max) => :maxid)
``````

which gives the above vector and:

``````julia> df[sortperm(select(df, r"id" => ByRow(max) => :maxid); rev=true),:]
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │    90    -64     89     -8
2 │   -80     76    -37    -58
3 │    71    -50     99    -99
4 │    34    -12      6    -76
5 │    24    -88    -69    -91
``````

does the reordering.

This also works:

``````julia> select(df, AsTable(r"id") => (t->maximum(t)) => :maxid)
5×1 DataFrame
Row │ maxid
│ Int64
─────┼───────
1 │    71
2 │    24
3 │    34
4 │   -80
5 │    90
``````

UPDATE: The following contained an error originally, and used `maximum` instead of `max.` as parameter to `orderby`. The fixed statement is:

Finally, using DataFramesMeta, it is rather compact:

``````julia> using DataFramesMeta

julia> @orderby df -max.(AsTable(r"id")...)
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │    90    -64     89     -8
2 │    71    -50     99    -99
3 │    34    -12      6    -76
4 │    24    -88    -69    -91
5 │   -80     76    -37    -58
``````
2 Likes

Thanks so much for saving the day again! I really appreciate it.

I’m just curious because all these DataFrame methods add a few orders of magnitude compared to your original approach.

``````df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │    76     95     77     56
2 │    33     77     20     30
3 │   -92     78     11     92
4 │     3    -68     60    -43
5 │   -42     39     43    -88
``````

So even applying `maximum` to a matrix

``````@benchmark maximum(Matrix(select(df,r"id")); dims=2)

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max):  3.031 μs … 329.958 μs  ┊ GC (min … max): 0.00% … 98.43%
Time  (median):     3.109 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   3.273 μs ±   6.510 μs  ┊ GC (mean ± σ):  3.95% ±  1.97%

▂ ▇█▅▃
▁▁▃▅█▇████▅▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
3.03 μs         Histogram: frequency by time        3.59 μs <

Memory estimate: 3.03 KiB, allocs estimate: 49.
``````

Takes orders of magnitude Thant your original method of just listing the `df` `id` columns and broadcasting `max` (though this delta shrink with an increase in rows.)

``````@benchmark max.(df.idA,df.idB,df.idD)
BenchmarkTools.Trial: 10000 samples with 197 evaluations.
Range (min … max):  451.569 ns …  10.453 μs  ┊ GC (min … max): 0.00% … 95.39%
Time  (median):     457.487 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   465.382 ns ± 236.886 ns  ┊ GC (mean ± σ):  1.24% ±  2.33%

▁ ▅▃█▃ ▁
▁▁▂▃▅▅█▇██████▄▅▃▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
452 ns           Histogram: frequency by time          483 ns <

Memory estimate: 160 bytes, allocs estimate: 3.
``````

Is there a way to gather all the `id` columns so that your original approach of `sortperm(max.(df.idA,df.idB,df.idD), rev=true)` can be used? Would some kind of comprehension with `names(df, r"id")` to reference the `Dataframe` columns be possible or advisable?

Yes:

``````julia> @benchmark max.(eachcol(\$df[!, r"id"])...)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max):  2.189 μs … 310.356 μs  ┊ GC (min … max): 0.00% … 98.75%
Time  (median):     2.444 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   2.743 μs ±   5.897 μs  ┊ GC (mean ± σ):  4.26% ±  1.97%

▁▆ █▇▂▂
▂███████▆▃▃▃▃▃▃▃▂▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
2.19 μs         Histogram: frequency by time        4.61 μs <

Memory estimate: 2.16 KiB, allocs estimate: 31.
``````

Note that most of the cost is in doing column lookup and selection via regex as:

``````julia> df2 = df[!, r"id"];

julia> @benchmark max.(eachcol(\$df2)...)
BenchmarkTools.Trial: 10000 samples with 169 evaluations.
Range (min … max):  621.302 ns …  14.826 μs  ┊ GC (min … max): 0.00% … 94.98%
Time  (median):     675.148 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   714.059 ns ± 338.369 ns  ┊ GC (mean ± σ):  1.35% ±  2.80%

█▅
▃▆▇██▆▆▄▃▃▅▆██▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
621 ns           Histogram: frequency by time         1.16 μs <

Memory estimate: 272 bytes, allocs estimate: 7.
``````

In general my question is why for such small tables you try to optimize performance?

1 Like

Sorry I’m sure I’m doing something wrong here but I get the following output

``````max.(eachcol(\$df[!, r"id"])...)
ERROR: syntax: "\$" expression outside quote around REPL:1
Stacktrace:
 top-level scope
@ REPL:1
``````

Also why do we need the splat operator here?

Yes thank you very much for pointing that out. The actual tables are exponentially larger but I think there is also a heavy component of curiosity and lack of experience that causes this unnecessary tinkering. I’ll be more cognizant of it in the future!

For this problem, the `{{ }}` syntax of DataFrameMacros.jl can be convenient. With that, you get a tuple of the referenced columns spliced into your expression, which you can call `maximum` on, and `-` it for reverse order:

``````julia> using DataFrameMacros

julia> df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │    33    -45    -55     36
2 │    58     21     -1     -5
3 │    34    -98     93     26
4 │   -83     25    -84     28
5 │   -84     97     21    -98

julia> @sort(df, -maximum({{r"id"}}))
5×4 DataFrame
Row │ idA    idB    C      idD
│ Int64  Int64  Int64  Int64
─────┼────────────────────────────
1 │   -84     97     21    -98
2 │    58     21     -1     -5
3 │    33    -45    -55     36
4 │    34    -98     93     26
5 │   -83     25    -84     28
``````

Edit: Ah I overlooked the similar DataFramesMeta solution posted above, seems like they added a special cased `AsTable` recently (second edit, well not that recently, a year ago)

``````@orderby df -maximum(AsTable(r"id"))
``````
2 Likes

Keep in mind though that performance on very small tables might not be indicative of performance on large tables, especially when using DataFrames machinery like `select`/`combine`/`transform` etc. which might have some fixed overhead which can be amortized over large tables.

2 Likes

`\$` is only needed when benchmarking, see GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language.

Because `max` takes multiple positional arguments and not one iterable argument (like `maximum`).

You misunderstood me. This was a valid question. If your tables are much larger then the cost of extra selection will be negligible, and your issue will disappear therefore using `DataFrame` and regex selector is OK (as they will have very small cost relative to the cost of actual computations). However, if you had a lot of (like millions ) small tables then the situation is different and another data structure should be used. But I understand your problem is the former.

2 Likes

Exactly - this was the reason of my question about the kind of data @phantom actually works with.

1 Like

Without prejudice to all the observations on the advisability of this aggressive treatment, here is another way of obtaining the same result in a perhaps faster way (it should also be verified for different table sizes)

``````max.([df[!,c] for c in names(df) if c!=("C")]...)
``````

or

``````max.([df[!,c] for c in filter(n->startswith(n,"id"), names(df) )]...)
``````
3 Likes

Thank you all again for being so generous with your insight and knowledge. It is incredibly helpful and much appreciated. I marked @Dan 's original answer due to the simplicity in the code and as pointed out by bkamins and nilshg the delta in competition time doesn’t necessarily carry over into larger tables. However, I might be wrong, but is it the case that the DataFramesMeta solution

should be modified to

``````
@orderby(df7, -max.(AsTable(r"id")...))

``````

I am likely missing something here, but in the following example each of these solutions generate the same `DataFrame`

``````df7 = DataFrame(Aid = rand(-100:100, 10000000), Bid = rand(-100:100, 10000000), C = rand(-100:100,10000000), Did = rand(-100:100,10000000))
``````
``````mxbyrow = df7[sortperm(select(df7, r"id" => ByRow(max) => :maxid); rev=true),:]

mxclmn = df7[sortperm(max.(df7.Aid, df7.Bid, df7.Did),rev = true), :]

mxeachclmn = df7[sortperm(max.(eachcol(df7[!, r"id"])...),rev = true), :]

mxfltr = df7[ sortperm(max.([df7[!,c] for c in filter(n-> endswith(n,"id"), names(df7))]...), rev=true),:]

using DataFrameMacros

mxmacro =  @sort(df7, -maximum({{r"id"}}))

mxfltr == mxclmn == mxeachclmn == mxmacro == mxbyrow
true

``````

However with the `DataFramesMeta` approach

``````using DataFrames
mxmeta =  @orderby df7 -maximum(AsTable(r"id"))

mxclmn == mxmeta
false
``````

but if I use the syntax from @bkamins previous post here Sort DataFrame by the greater of multiple columns

``````mxmeta2  =  @orderby(df7, -max.(AsTable(r"id")...))

``````

then

``````mxclmn == mxmeta2
true
``````

so

``````@orderby(df7, -max.(AsTable(r"id")...))  == @orderby df7 -maximum(AsTable(r"id"))
false
``````

Is this correct? I am probably missing something but I can’t quite figure out how `@orderby df7 -maximum(AsTable(r"id"))` is ranking the rows.

Just as an aside in terms of speed, if the rows are ranked in ascending rather than descending order, it seems that the non macro approaches are still a bit faster.

`````` @benchmark @orderby(\$df7, max.(AsTable(r"id")...))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min … max):  967.029 ms …   1.118 s  ┊ GC (min … max):  0.00% … 13.52%
Time  (median):        1.105 s              ┊ GC (median):    12.17%
Time  (mean ± σ):      1.078 s ± 62.659 ms  ┊ GC (mean ± σ):  10.06% ±  5.54%

█                                                █    ██   █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁█ ▁
967 ms          Histogram: frequency by time          1.12 s <

Memory estimate: 2.24 GiB, allocs estimate: 2957653.

julia> @benchmark \$df7[sortperm(max.(eachcol(\$df7[!, r"id"])...)), :]
BenchmarkTools.Trial: 25 samples with 1 evaluation.
Range (min … max):  193.540 ms … 233.784 ms  ┊ GC (min … max): 0.00% … 16.63%
Time  (median):     196.598 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   200.730 ms ±  10.667 ms  ┊ GC (mean ± σ):  1.98% ±  4.84%

▁▄▁█▁    ▁▁
█████▆▆▁▆██▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▆ ▁
194 ms           Histogram: frequency by time          234 ms <

Memory estimate: 457.77 MiB, allocs estimate: 59.
``````

`max.` (with dot) should be used. `maximum` takes a collection and returns its maximal element (whole vectors are processed). See the difference:

``````julia> x = [3, 7, 1]
3-element Vector{Int64}:
3
7
1

julia> y = [4, 6, 3]
3-element Vector{Int64}:
4
6
3

julia> z = [5, 3, 2]
3-element Vector{Int64}:
5
3
2

julia> max.(x, y, z)
3-element Vector{Int64}:
5
7
3

julia> maximum([x,y,z])
3-element Vector{Int64}:
5
3
2
``````
1 Like

Got it thanks!

Sorry just writing it out for future reference

``````
@orderby df -maximum(AsTable(r"id"))

``````
1. Selects the `id` column in the `DataFrame` with the largest element in the first row. (`Aid` in the example below)

2. Orders the rows of the `DataFrame` by ranking the elements of that one column in ascending `maximum` or descending `-maximum` order.

``````df = DataFrame(Aid = rand(1:10, 4), Bid = rand(1:10, 4), C = rand(1:10,4))
4×3 DataFrame
Row │ Aid    Bid    C
│ Int64  Int64  Int64
─────┼─────────────────────
1 │     6      2      9
2 │     5      6      2
3 │     9      5      3
4 │     2      7      8

@orderby df -maximum(AsTable(r"id"))
# selects Aid because df.Aid > df.Bid
# ranks df according to Aid

4×3 DataFrame
Row │ Aid    Bid    C
│ Int64  Int64  Int64
─────┼─────────────────────
1 │     9      5      3
2 │     6      2      9
3 │     5      6      2
4 │     2      7      8

``````

whereas

``````@orderby(df, -max.(AsTable(r"id")...))
``````

Orders the rows of the `DataFrame` by ranking the max elements in each row of all `id` columns.

``````julia> @orderby(dft, -max.(AsTable(r"id")...))
4×3 DataFrame
Row │ Aid    Bid    C
│ Int64  Int64  Int64
─────┼─────────────────────
1 │     9      5      3
2 │     2      7      8
3 │     6      2      9
4 │     5      6      2

``````

and in case of tie second row etc. as long as there is a tie (this is called lexicographic order).

1 Like

I’ve fixed my post above according to this (use `max.` instead of `maximum`).
The source of the confusion, is that `maximum` can be used for row-wise maximum by adding the `dims=2` named parameter on a matrix argument (at the top of my post).

2 Likes

or maximum for row-wise maximum using the zip function

``````maximum.(zip([df[!,c] for c in names(df) if c!="C"]...));
``````
``````df[sortperm([maximum(r) for r in zip([df[!,c] for c in filter(!=("C"), names(df) )]...)], lt= >= ),:]
``````

PS

it seems that the last form has problems.

1 Like