Hi pretty sure I’m overlooking something simple here but would appreciate some pointers.
Suppose I have the following DataFrame
df = DataFrame( idA = rand(-100:100,5), idB = rand(-100:100,5), C = rand(-100:100,5), idD = rand(-100:100,5))
5×4 DataFrame
Row │ idA idB C idD
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ -76 80 11 13
2 │ 35 1 93 -24
3 │ -50 -79 80 -1
4 │ -41 36 -50 89
5 │ 27 -100 34 37
I can get an ranked index of the maximum values of the ‘id’ columns as per @Dan’s solution here Sort DataFrame by the greater of multiple columns with
sortperm(max.(df.idA,df.idB,df.idD),rev=true)
5-element Vector{Int64}:
4
1
5
2
3
I can then use this index to rank the DataFrame
by max values of the selected columns like (again from @Dan 's answer)
permute!(df, sortperm(max.(df.A, df.B); rev=true))
For a large number of id
columns however it would be easier to collect the id
columns without explicitly listing each column. What would be the best approach to get the array of (df.idA,df.idB,df.idD)
without having to manually list them?
As a work around I tried
select(df, r"id")
However this creates a new DataFrame
so broadcasting max
via the .
operator returns
max.(select(df,r"id"))
5×3 DataFrame
Row │ idA idB idD
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ -76 80 13
2 │ 35 1 -24
3 │ -50 -79 -1
4 │ -41 36 89
5 │ 27 -100 37
as opposed to the desired result of
max.(df.idA,df.idB,df.idD)
5-element Vector{Int64}:
80
35
-1
89
37
converting the DataFrame to Arrays via Matrix
max.(Matrix(select(df,r"id")))
5×3 Matrix{Int64}:
-76 80 13
35 1 -24
-50 -79 -1
-41 36 89
27 -100 37
also did not work so I switched to
maximum.(eachrow(select(df,r"id")))
5-element Vector{Int64}:
80
35
-1
89
37
However I noticed that while very fast, it is orders of magnitude slower than the initial max
approach. (I’m assuming this has to do with type stability between columns?)
# maximum approach
@benchmark maximum.(eachrow(select(df,r"id")))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 5.312 μs … 13.681 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.444 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.486 μs ± 196.884 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇██ █▆▄ ▁
▁▂▄▇▅████████▆█▇▆▅▃▄▄▃▂▃▃▃▃▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
5.31 μs Histogram: frequency by time 6.04 μs <
Memory estimate: 3.27 KiB, allocs estimate: 59.
# max approach
@benchmark max.(df.idA,df.idB,df.idD)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
Range (min … max): 477.888 ns … 665.393 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 489.796 ns ┊ GC (median): 0.00%
Time (mean ± σ): 491.454 ns ± 8.238 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃▃█▄▄▅▁
▁▁▁▂▂▂▃▃▄▇▇███████████▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
478 ns Histogram: frequency by time 521 ns <
Memory estimate: 160 bytes, allocs estimate: 3.
converting the DataFrame
to Arrays does not seem to shrink the difference much
@benchmark maximum.(eachrow(Matrix(select(df,r"id"))))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.714 μs … 10.755 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.833 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.867 μs ± 140.935 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▁█▆▅ ▁ ▁
▁▂▃▄███████▆█▇▆▃▆▇█▅██▇▄▆▆▄▂▄▃▃▂▃▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
3.71 μs Histogram: frequency by time 4.25 μs <
Memory estimate: 3.09 KiB, allocs estimate: 43.
So I was just wondering 1) is there a way to collect the id
columns and then broadcast max
? 2) what is happening with max.(df)
so that the initial DataFrame
is being returned? I looked through the documentation but am still unsure about it so any insights would be greatly appreciated. Thanks!