Any way to speed up sorting a dataframe? A more efficient sortperm would be great

xiaodai · June 24, 2020, 4:01pm

Consideirng a DataFrame

using DataFrames
using Random: randstring
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))

@time sort!(df, :int); 
# 80s on my machine

@time sort!(df, :str); 
# 170son my machine

using CSV
CSV.write("tmp.csv", df)

The same operation using R’s data.table is like 3s

library(data.table)
df = fread("tmp.csv")
setkey(df, "int") 
# 3s 

setkey(df, "str") 
# 25s

So based on this the performance of data.table is still much better.

Now the sort! algorithm is really simple which I can replicate here

]add https://github.com/xiaodaigh/SortingLab.jl
using SortingLab
using Base.Threads: @spawn
function another_sort!(df, col)
    @time ordering = fsortperm(df[!, col])
    channel_lock = Channel{Bool}(length(names(df)))
    for c in names(df)
        @spawn begin
            v = df[!, c]
            @inbounds v = v[ordering]
            put!(channel_lock, true)
        end
    end
    for _ in names(df)
        take!(channel_lock)
    end
    df
end

@time another_sort!(df, :int); # sortperm is 10s total 12s~18s
@time another_sort!(df, :str); # sortperm is 10s total 12s~18s

You can see that (f)sortperm takes 10s. So using a more optimise sortperm like SortingLab.fsortperm can get much better results already.

The solution seems to be about finding a more efficient sortperm. For a start, perhaps adapting SortingLab.fsortperm would be a good start.

Topic		Replies	Views
ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames Package Announcements sort , dataframes	6	888	June 26, 2020
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	938	December 6, 2017
Faster sorting of DataFrames.jl via background ordering Performance sort , dataframes	0	1421	August 17, 2020
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4632	January 29, 2018
WIP: faster string sort Internals & Design strings , sort	92	12303	February 8, 2018

Any way to speed up sorting a dataframe? A more efficient sortperm would be great

Related topics