Sort DataFrame by the greater of multiple columns

Is there a way to sort a DataFrame by the greater of multiple columns e.g.

df = DataFrame(A = [21,22,18,19], B = [18,17,25,30])
4×2 DataFrame
 Row │ A      B     
     │ Int64  Int64 
─────┼──────────────
   1 │    21     18
   2 │    22     17
   3 │    18     25
   4 │    19     30

df should be sorted such that the row with the greater of df.A and df.B should be placed first. The desired outcome would be

4×2 DataFrame
 Row │ A      B     
     │ Int64  Int64 
─────┼──────────────
   1 │    19     30
   2 │    18     25
   3 │    22     17
   4 │    21     18

I haven’t been able to find anything in the sort documentation that would facilitate this so any thoughts would be appreciated. Thanks!

1 Like

One method might be:

julia> df[sortperm(max.(df.A, df.B); rev=true),:]
4×2 DataFrame
 Row │ A      B     
     │ Int64  Int64 
─────┼──────────────
   1 │    19     30
   2 │    18     25
   3 │    22     17
   4 │    21     18

and to mutate df according to new order:

permute!(df, sortperm(max.(df.A, df.B); rev=true))
1 Like

This is the approach for plain DataFrames.jl. With DataFramesMeta.jl do:

julia> using DataFramesMeta

julia> @orderby(df, -max.(:A, :B))
4×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │    19     30
   2 │    18     25
   3 │    22     17
   4 │    21     18
2 Likes

Thanks so much these solutions they are super helpful! Sorry it took me a little longer to understand how sortperm was working here. Just to clarify sortperm is returning a vector of row indices ordered by the row wise max of the values in the two columns?

Also it won’t make a practical difference in this example since they are both so fast, so I understand if you don’t have time to go into it, but I was just wondering why the difference in speed of the two approaches appears relatively large.

# sortperm indexing approach
@benchmark df[sortperm(max.(df.A, df.B); rev=true),:] 
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.462 μs …  4.633 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.518 μs ± 84.809 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ██ ▃▁                                                    
  ▂▆▇█████▅▆▅▃▃▃▂▃▃▃▃▃▂▃▃▂▂▂▂▂▂▁▂▂▁▂▂▂▂▁▂▂▁▂▂▂▁▂▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  1.46 μs        Histogram: frequency by time        1.86 μs <

 Memory estimate: 1.14 KiB, allocs estimate: 15.

# DataFramesMeta orderby approach
@benchmark @orderby(df, -max.(:A, :B))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.791 μs …  90.875 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.292 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.402 μs ± 970.047 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▅ ▆▇▃ █ ▃▅ ▄▃  ▁                                    
  ▂▂▁▂▃▁▅▇▅▇█▁███▆█▆██▁██▇▅█▄▆▇▇▁▆▅▃▅▃▄▅▄▁▄▃▂▃▃▁▃▂▁▃▂▂▂▂▁▂▂▁▂▂ ▄
  16.8 μs         Histogram: frequency by time         18.5 μs <

 Memory estimate: 9.41 KiB, allocs estimate: 164.

and with a larger number of rows in the DataFrame

df6 = DataFrame( A = rand(Int,1000000), B = rand(Int,1000000))

# DataFramesMeta orderby method
@benchmark @orderby(df6, -max(:A,:B))
BenchmarkTools.Trial: 39 samples with 1 evaluation.
 Range (min … max):  122.419 ms … 148.830 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     128.563 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.022 ms ±   4.472 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █                                                
  ▄▁▁▇▁▆▁▄▁▇▄▆▆▇█▇▁▁▇▁▄▄▄▁▁▁▁▁▁▆▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  122 ms           Histogram: frequency by time          149 ms <

 Memory estimate: 90.90 MiB, allocs estimate: 65116.
 
# sortperm indexing method 
@benchmark  df6[sortperm(max.(df6.A, df6.B); rev=true),:]
BenchmarkTools.Trial: 60 samples with 1 evaluation.
 Range (min … max):  82.176 ms … 91.108 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     84.036 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   84.381 ms ±  1.486 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

               █▂                                              
  ▄▃▃▁▃▁▁▁▃▃▃████▅▃▄▄▃▄▃▅▁▁▁▁▁▁▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  82.2 ms         Histogram: frequency by time        90.2 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 2

The difference in performance is because solution using sortperm is a custom approach and @orderby uses a more general mechanism (thus it is less optimized).

1 Like

I’m not sure that’s the case. I can replicate the benchmark:


julia> nomac(df6) = df6[sortperm(@.(-max(df6.A, df6.B))),:];

julia> mac(df6) = @orderby(df6, @. -max(:A,:B));

julia> @btime mac($df6);
  224.913 ms (65104 allocations: 90.91 MiB)

julia> @btime nomac($df6);
  151.738 ms (24 allocations: 30.52 MiB)

but looking at the implementation, it does use sortperm.

function orderby_helper(x, args...)
    x, exprs, outer_flags, kw = get_df_args_kwargs(x, args...; wrap_byrow = false)
    t = (fun_to_vec(ex; gensym_names = true, outer_flags = outer_flags) for ex in exprs)
    quote
        $orderby($x, $(t...); $(kw...))
    end
end

function orderby(x::AbstractDataFrame, @nospecialize(args...))
    t = DataFrames.select(x, args...; copycols = false)
    x[sortperm(t), :]
end

the difference is too big to be due to the overhead of select, especially with copycols = false. So I don’t know what’s going on.

EDIT: I see, it’s the difference between sortperm(v::Vector) and sortperm(df::DataFrame). The latter is what @orderby calls, and this requires extra work in case the DataFrame has multiple columns.

There is room for a specialized implementation if the DataFrame has one column, which I think DataFramesMeta.jl can do.

1 Like