Sort rows in a dataframe based on a predefined order

New to Julia, and have taken (much) effort in seeking a solution to the ‘simple’ question but failed.

julia> df = DataFrame(col1= ['a', 'b', 'c'], col2 = [1,2,3])
3×2 DataFrame
 Row │ col1  col2  
     │ Char  Int64 
─────┼─────────────
   1 │ a         1
   2 │ b         2
   3 │ c         3

julia> order1 = ["b", "a", "c"]
3-element Vector{String}:
 "b"
 "a"
 "c"

Please help on how to sort the ‘df’ to follow the ‘order1’.
THANK YOU!

One thing to note, in Julia, unlike python and R, 'a' is not the same as "a". The first is a character, the second is a string.

This is actually kind of hard. I can’t think of a good solution using DataFrames.jl’s sort function. So instead here is a solution using DataFramesMeta.jl.

julia> df = DataFrame(col1= ["a", "b", "c"], col2 = [1,2,3]);

julia> order1 = ["b", "a", "c"];

julia> using DataFramesMeta

julia> @rorderby df findfirst(==(:col1), order1)
3×2 DataFrame
 Row │ col1    col2  
     │ String  Int64 
─────┼───────────────
   1 │ b           2
   2 │ a           1
   3 │ c           3

I hope someone things of something better!

3 Likes

Alternative without DataFramesMeta package:

julia> df.order=[findfirst(df.col1.==x) for x in order1]
3-element Vector{Int64}:
 2
 1
 3

julia> sort!(df, :order)
3×3 DataFrame
 Row │ col1  col2   order 
     │ Char  Int64  Int64 
─────┼────────────────────
   1 │ b         2      1
   2 │ a         1      2
   3 │ c         3      3

julia> select!(df, Not(:order))
3×2 DataFrame
 Row │ col1  col2  
     │ Char  Int64 
─────┼─────────────
   1 │ b         2
   2 │ a         1
   3 │ c         3
2 Likes

You could also consider (initial post had the arguments order swapped, corrected now):

df[indexin(order1, df.col1),:]

result:

3×2 DataFrame
 Row │ col1    col2  
     │ String  Int64 
─────┼───────────────
   1 │ b           2
   2 │ a           1
   3 │ c           3
6 Likes

Thanks Rafael,
Your code works on this example! But it doesn’t for the dataframe below:

julia> df2
5×2 DataFrame
 Row │ sample         reads    
     │ String         Int64    
─────┼─────────────────────────
   1 │ RD58-100pg_S7   6096257
   2 │ RD58-10ng_S10  12590390
   3 │ RD58-10pg_S6    2216467
   4 │ RD58-1ng_S9     5381029
   5 │ RD58-500pg_S8   6870582

julia> myOrder = ["RD58-10pg_S6", "RD58-100pg_S7", "RD58-500pg_S8", "RD58-1ng_S9", "RD58-10ng_S10"]
5-element Vector{String}:
 "RD58-10pg_S6"
 "RD58-100pg_S7"
 "RD58-500pg_S8"
 "RD58-1ng_S9"
 "RD58-10ng_S10"

julia> df2[indexin(df2.sample, myOrder),:]
5×2 DataFrame
 Row │ sample         reads    
     │ String         Int64    
─────┼─────────────────────────
   1 │ RD58-10ng_S10  12590390
   2 │ RD58-500pg_S8   6870582
   3 │ RD58-100pg_S7   6096257
   4 │ RD58-1ng_S9     5381029
   5 │ RD58-10pg_S6    2216467

@pdeffebach’s code works well.

julia> @rorderby df2 findfirst(==(:sample), myOrder)
5×2 DataFrame
 Row │ sample         reads    
     │ String         Int64    
─────┼─────────────────────────
   1 │ RD58-10pg_S6    2216467
   2 │ RD58-100pg_S7   6096257
   3 │ RD58-500pg_S8   6870582
   4 │ RD58-1ng_S9     5381029
   5 │ RD58-10ng_S10  12590390

Thanks!

Sorry, the order of the arguments needs to be swapped (corrected order in post above).

For your last example:

using DataFrames
col1 = ["RD58-100pg_S7", "RD58-10ng_S10", "RD58-10pg_S6", "RD58-1ng_S9", "RD58-500pg_S8"]
col2 = [6096257, 12590390, 2216467, 5381029,6870582]
df = DataFrame(col1=col1, col2=col2)
order1 = ["RD58-10pg_S6", "RD58-100pg_S7", "RD58-500pg_S8", "RD58-1ng_S9", "RD58-10ng_S10"]

Herein the result:

df[indexin(order1, df.col1),:]

5×2 DataFrame
 Row │ col1           col2     
     │ String         Int64    
─────┼─────────────────────────
   1 │ RD58-10pg_S6    2216467
   2 │ RD58-100pg_S7   6096257
   3 │ RD58-500pg_S8   6870582
   4 │ RD58-1ng_S9     5381029
   5 │ RD58-10ng_S10  12590390
1 Like