# Sort DataFrame by the greater of multiple columns

Is there a way to `sort` a `DataFrame` by the greater of multiple columns e.g.

``````df = DataFrame(A = [21,22,18,19], B = [18,17,25,30])
4×2 DataFrame
Row │ A      B
│ Int64  Int64
─────┼──────────────
1 │    21     18
2 │    22     17
3 │    18     25
4 │    19     30
``````

`df` should be sorted such that the row with the greater of `df.A` and `df.B` should be placed first. The desired outcome would be

``````4×2 DataFrame
Row │ A      B
│ Int64  Int64
─────┼──────────────
1 │    19     30
2 │    18     25
3 │    22     17
4 │    21     18
``````

I haven’t been able to find anything in the `sort` documentation that would facilitate this so any thoughts would be appreciated. Thanks!

1 Like

One method might be:

``````julia> df[sortperm(max.(df.A, df.B); rev=true),:]
4×2 DataFrame
Row │ A      B
│ Int64  Int64
─────┼──────────────
1 │    19     30
2 │    18     25
3 │    22     17
4 │    21     18
``````

and to mutate `df` according to new order:

``````permute!(df, sortperm(max.(df.A, df.B); rev=true))
``````
1 Like

This is the approach for plain DataFrames.jl. With DataFramesMeta.jl do:

``````julia> using DataFramesMeta

julia> @orderby(df, -max.(:A, :B))
4×2 DataFrame
Row │ A      B
│ Int64  Int64
─────┼──────────────
1 │    19     30
2 │    18     25
3 │    22     17
4 │    21     18
``````
2 Likes

Thanks so much these solutions they are super helpful! Sorry it took me a little longer to understand how `sortperm` was working here. Just to clarify `sortperm` is returning a vector of row indices ordered by the row wise max of the values in the two columns?

Also it won’t make a practical difference in this example since they are both so fast, so I understand if you don’t have time to go into it, but I was just wondering why the difference in speed of the two approaches appears relatively large.

``````# sortperm indexing approach
@benchmark df[sortperm(max.(df.A, df.B); rev=true),:]
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max):  1.462 μs …  4.633 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.500 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   1.518 μs ± 84.809 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

██ ▃▁
▂▆▇█████▅▆▅▃▃▃▂▃▃▃▃▃▂▃▃▂▂▂▂▂▂▁▂▂▁▂▂▂▂▁▂▂▁▂▂▂▁▂▁▁▂▂▂▂▂▂▂▂▂▂ ▃
1.46 μs        Histogram: frequency by time        1.86 μs <

Memory estimate: 1.14 KiB, allocs estimate: 15.

# DataFramesMeta orderby approach
@benchmark @orderby(df, -max.(:A, :B))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  16.791 μs …  90.875 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     17.292 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   17.402 μs ± 970.047 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▅ ▆▇▃ █ ▃▅ ▄▃  ▁
▂▂▁▂▃▁▅▇▅▇█▁███▆█▆██▁██▇▅█▄▆▇▇▁▆▅▃▅▃▄▅▄▁▄▃▂▃▃▁▃▂▁▃▂▂▂▂▁▂▂▁▂▂ ▄
16.8 μs         Histogram: frequency by time         18.5 μs <

Memory estimate: 9.41 KiB, allocs estimate: 164.
``````

and with a larger number of rows in the `DataFrame`

``````df6 = DataFrame( A = rand(Int,1000000), B = rand(Int,1000000))

# DataFramesMeta orderby method
@benchmark @orderby(df6, -max(:A,:B))
BenchmarkTools.Trial: 39 samples with 1 evaluation.
Range (min … max):  122.419 ms … 148.830 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     128.563 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   129.022 ms ±   4.472 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

█
▄▁▁▇▁▆▁▄▁▇▄▆▆▇█▇▁▁▇▁▄▄▄▁▁▁▁▁▁▆▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
122 ms           Histogram: frequency by time          149 ms <

Memory estimate: 90.90 MiB, allocs estimate: 65116.

# sortperm indexing method
@benchmark  df6[sortperm(max.(df6.A, df6.B); rev=true),:]
BenchmarkTools.Trial: 60 samples with 1 evaluation.
Range (min … max):  82.176 ms … 91.108 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     84.036 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   84.381 ms ±  1.486 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▂
▄▃▃▁▃▁▁▁▃▃▃████▅▃▄▄▃▄▃▅▁▁▁▁▁▁▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
82.2 ms         Histogram: frequency by time        90.2 ms <

Memory estimate: 30.52 MiB, allocs estimate: 2
``````

The difference in performance is because solution using `sortperm` is a custom approach and `@orderby` uses a more general mechanism (thus it is less optimized).

1 Like

I’m not sure that’s the case. I can replicate the benchmark:

``````
julia> nomac(df6) = df6[sortperm(@.(-max(df6.A, df6.B))),:];

julia> mac(df6) = @orderby(df6, @. -max(:A,:B));

julia> @btime mac(\$df6);
224.913 ms (65104 allocations: 90.91 MiB)

julia> @btime nomac(\$df6);
151.738 ms (24 allocations: 30.52 MiB)
``````

but looking at the implementation, it does use `sortperm`.

``````function orderby_helper(x, args...)
x, exprs, outer_flags, kw = get_df_args_kwargs(x, args...; wrap_byrow = false)
t = (fun_to_vec(ex; gensym_names = true, outer_flags = outer_flags) for ex in exprs)
quote
\$orderby(\$x, \$(t...); \$(kw...))
end
end

function orderby(x::AbstractDataFrame, @nospecialize(args...))
t = DataFrames.select(x, args...; copycols = false)
x[sortperm(t), :]
end
``````

the difference is too big to be due to the overhead of `select`, especially with `copycols = false`. So I don’t know what’s going on.

EDIT: I see, it’s the difference between `sortperm(v::Vector)` and `sortperm(df::DataFrame)`. The latter is what `@orderby` calls, and this requires extra work in case the DataFrame has multiple columns.

There is room for a specialized implementation if the `DataFrame` has one column, which I think DataFramesMeta.jl can do.

1 Like