Any way to speed up sorting a dataframe? A more efficient sortperm would be great

xiaodai · June 24, 2020, 4:01pm

Consideirng a DataFrame

using DataFrames
using Random: randstring
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))

@time sort!(df, :int); 
# 80s on my machine

@time sort!(df, :str); 
# 170son my machine

using CSV
CSV.write("tmp.csv", df)

The same operation using R’s data.table is like 3s

library(data.table)
df = fread("tmp.csv")
setkey(df, "int") 
# 3s 

setkey(df, "str") 
# 25s

So based on this the performance of data.table is still much better.

Now the sort! algorithm is really simple which I can replicate here

]add https://github.com/xiaodaigh/SortingLab.jl
using SortingLab
using Base.Threads: @spawn
function another_sort!(df, col)
    @time ordering = fsortperm(df[!, col])
    channel_lock = Channel{Bool}(length(names(df)))
    for c in names(df)
        @spawn begin
            v = df[!, c]
            @inbounds v = v[ordering]
            put!(channel_lock, true)
        end
    end
    for _ in names(df)
        take!(channel_lock)
    end
    df
end

@time another_sort!(df, :int); # sortperm is 10s total 12s~18s
@time another_sort!(df, :str); # sortperm is 10s total 12s~18s

You can see that (f)sortperm takes 10s. So using a more optimise sortperm like SortingLab.fsortperm can get much better results already.

The solution seems to be about finding a more efficient sortperm. For a start, perhaps adapting SortingLab.fsortperm would be a good start.

Tamas_Papp · June 25, 2020, 9:12am

Since this is suggesting a very specific improvement to a package, opening a pull request or at least an issue there might be the best way to start a discussion about this.

xiaodai · June 25, 2020, 9:55am

Already discussed on slack.

Topic		Replies	Views
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	941	December 6, 2017
ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames Package Announcements sort , dataframes	6	891	June 26, 2020
Faster sorting of DataFrames.jl via background ordering Performance sort , dataframes	0	1423	August 17, 2020
Sort DataFrame by the greater of multiple columns New to Julia question , sort , dataframes	5	523	January 4, 2023
Issorted for Dataframe rows Performance	5	357	September 8, 2020

Any way to speed up sorting a dataframe? A more efficient sortperm would be great

Related topics