Custom DataFrame column sort order

question
data

#1

Is it possible to sort a DataFrame column based on custom sort order, e.g.

df = DataFrame(col1= ['a','b','c'], col2 = [1,2,3])
sort!(df, col1=custom_order('a','c','b'))

#2

You can always use

df[p, :]

where p is a permutation that encodes your custom order, but if it can be specified by some user-supplied less-than function, you are better off using the sort method for DataFrame.


#3

Elaborating on this SO question, I am trying to use a custom less-than function, to be used with the lt sort keyword, but I don’t know how to provide to the function a third parameter that is the wanted β€œcustom order” (for multiple column sort):

For example, this works:

using DataFrames
df = DataFrame(
  c1 = ['a','b','c','a','b','c'],
  c2 = ["aa","aa","bb","bb","cc","cc"],
  c3 = [1,2,3,10,20,30]
)
6Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ c1  β”‚ c2   β”‚ c3 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 1   β”‚ 'a' β”‚ "aa" β”‚ 1  β”‚
β”‚ 2   β”‚ 'b' β”‚ "aa" β”‚ 2  β”‚
β”‚ 3   β”‚ 'c' β”‚ "bb" β”‚ 3  β”‚
β”‚ 4   β”‚ 'a' β”‚ "bb" β”‚ 10 β”‚
β”‚ 5   β”‚ 'b' β”‚ "cc" β”‚ 20 β”‚
β”‚ 6   β”‚ 'c' β”‚ "cc" β”‚ 30 β”‚

ordc2 = ["bb","aa","cc"]
function customLt(r1,r2)
    return ( find(x -> x == r1, ordc2)[1] < find(x -> x == r2, ordc2)[1] )
end
sortedDf =  sort(df, cols = [order(:c2, lt=customLt)])
6Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ c1  β”‚ c2   β”‚ c3 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 1   β”‚ 'c' β”‚ "bb" β”‚ 3  β”‚
β”‚ 2   β”‚ 'a' β”‚ "bb" β”‚ 10 β”‚
β”‚ 3   β”‚ 'a' β”‚ "aa" β”‚ 1  β”‚
β”‚ 4   β”‚ 'b' β”‚ "aa" β”‚ 2  β”‚
β”‚ 5   β”‚ 'b' β”‚ "cc" β”‚ 20 β”‚
β”‚ 6   β”‚ 'c' β”‚ "cc" β”‚ 30 β”‚

But this doesn’t:

using DataFrames
df = DataFrame(
  c1 = ['a','b','c','a','b','c'],
  c2 = ["aa","aa","bb","bb","cc","cc"],
  c3 = [1,2,3,10,20,30]
)
ordc1 = ['b','a','c']
ordc2 = ["bb","aa","cc"]
function customLt(r1,r2,col)
    return ( find(x -> x == r1, col)[1] < find(x -> x == r2, col)[1] )
end
sortedDf =  sort(df, cols = [order(:c2, lt=customLt(ordc2)),order(:c1, lt=customLt(ordc1))] )

MethodError: no method matching customLt(::Array{String,1})
Closest candidates are:
  customLt(::Any, !Matched::Any) at /home/lobianco/git/ffsm_pp/00_private/2016_foretcc/data/output/test.jl:59
  customLt(::Any, !Matched::Any, !Matched::Any) at /home/lobianco/git/ffsm_pp/00_private/2016_foretcc/data/output/test.jl:73
[...]

#4

This should be faster for lookup:

using DataFrames
df = DataFrame(
  c1 = ['a','b','c','a','b','c'],
  c2 = ["aa","aa","bb","bb","cc","cc"],
  c3 = [1,2,3,10,20,30]
)
ordc2 = ["bb","aa","cc"]
orderdict = Dict(x => i for (i,x) in enumerate(ordc2))
sort(df; cols = [:c2], by = x->orderdict[x])

#5

Thank you… I managed to implement it for multiple columns and to place it with a function where users can select custom order (eventually partially) for each of the columns:

function customSort!(df::DataFrame, sortops)
    sortv = []
    sortOptions = []
    if(isa(sortops, Array))
        sortv = sortops
    else
        push!(sortv,sortops)
    end
    for i in sortv
        if(isa(i, Tuple))
            if (isa(i[2], Array)) # The second option is a custom order
                orderArray = Array(collect(union(    OrderedSet(i[2]),  OrderedSet(unique(df[i[1]]))        )))
                push!(sortOptions, order(i[1], by = x->Dict(x => i for (i,x) in enumerate(orderArray))[x] ))
            else                  # The second option is a reverse direction flag
                push!(sortOptions, order(i[1], rev = i[2]))
            end
        else
          push!(sortOptions, order(i))
        end
    end
    return sort!(df, cols = sortOptions)
end

df = DataFrame(
  c1 = ['a','b','c','a','b','c'],
  c2 = ["aa","aa","bb","bb","cc","cc"],
  c3 = [1,2,3,10,20,30],
)
customSort!(df, [(:c2,["bb","cc"]),(:c1,['b','a','c'])])
6Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ c1  β”‚ c2   β”‚ c3 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 1   β”‚ 'a' β”‚ "bb" β”‚ 10 β”‚
β”‚ 2   β”‚ 'c' β”‚ "bb" β”‚ 3  β”‚
β”‚ 3   β”‚ 'b' β”‚ "cc" β”‚ 20 β”‚
β”‚ 4   β”‚ 'c' β”‚ "cc" β”‚ 30 β”‚
β”‚ 5   β”‚ 'b' β”‚ "aa" β”‚ 2  β”‚
β”‚ 6   β”‚ 'a' β”‚ "aa" β”‚ 1  β”‚

if someone needs it, I uploaded to github together with other utility functions on https://github.com/sylvaticus/LAJuliaUtils.jl

To use it:

Pkg.clone("https://github.com/sylvaticus/LAJuliaUtils.jl.git")
using LAJuliaUtils
?customSort!

#6

Another way to tackle this problem is to use a column type where you can control the sorting. The advantage comes when sorting is done behind the scenes (with grouping or plotting for example); the order will come out like you want.

Here is an example with PooledArrays.jl:

julia> d = PooledArray(["a", "b", "a"], ["b", "a"])
3-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "a"
 "b"
 "a"

julia> sort(d)
3-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "b"
 "a"
 "a"

Note that this trick doesn’t work with DataArrays. I didn’t check if it works with CategoricalArrays.