Any way to speed up sorting a dataframe? A more efficient sortperm would be great

Consideirng a DataFrame

using DataFrames
using Random: randstring
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))

@time sort!(df, :int); 
# 80s on my machine

@time sort!(df, :str); 
# 170son my machine

using CSV
CSV.write("tmp.csv", df)

The same operation using R’s data.table is like 3s

library(data.table)
df = fread("tmp.csv")
setkey(df, "int") 
# 3s 

setkey(df, "str") 
# 25s 

So based on this the performance of data.table is still much better.

Now the sort! algorithm is really simple which I can replicate here

]add https://github.com/xiaodaigh/SortingLab.jl
using SortingLab
using Base.Threads: @spawn
function another_sort!(df, col)
    @time ordering = fsortperm(df[!, col])
    channel_lock = Channel{Bool}(length(names(df)))
    for c in names(df)
        @spawn begin
            v = df[!, c]
            @inbounds v = v[ordering]
            put!(channel_lock, true)
        end
    end
    for _ in names(df)
        take!(channel_lock)
    end
    df
end

@time another_sort!(df, :int); # sortperm is 10s total 12s~18s
@time another_sort!(df, :str); # sortperm is 10s total 12s~18s

You can see that (f)sortperm takes 10s. So using a more optimise sortperm like SortingLab.fsortperm can get much better results already.

The solution seems to be about finding a more efficient sortperm. For a start, perhaps adapting SortingLab.fsortperm would be a good start.

Since this is suggesting a very specific improvement to a package, opening a pull request or at least an issue there might be the best way to start a discussion about this.

Already discussed on slack.