Faster sorting of DataFrames.jl via background ordering

xiaodai · August 17, 2020, 8:18am

I have had this idea for a long time, now I finally gave it a go and the results are amazing. So here’s the idea.

Here sorting a large dataframe the algorithm roughly goes like this

Perform a sortperm on the columns to sort by. This returns an idx of indices.
Consider colX which his not a sort by column then colX[idx] would give order the colX in the right order.
Repeat number 2 for all columns.

There are room for improvement for 1 and 2 but 3 is low-hanging fruit.

Once the dataframe is sorted, most users wouldn’t start using all columns in the dataframe. For example, printing to REPL only prints the few columns. Here is our opportunity and idea

Why not just return the dataframe and re-order the columns in the background.

If the user requests a certain column then wait for that column to be sorted and return the sorted column.

Below is a very crude implementation of that idea using ThreadPools.jl (which allows background threads).

But see the “performance” difference is stark!

If you did read the source you may be wondering, if someone uses the column before they have finished sorting that would cause trouble. True, but that’s I haven’t implemented the full idea which is replaced each column with a InProgressVector and when you access the InProgressVector it will be waiting for the sort to finish and return the sorted Vector AND it will replace itself in the DataFrame with the properly sorted vector. I haven’t implemented those yet, but you get the idea.

using ThreadPools: @bthreads
using DataFrames
using SortingLab: sorttwo!

# function barrier for assigining values
assignnew!(v, idx) = v .= v[idx]

function fsort2!(dataframe, by)
    idx = sortperm(dataframe[!, by])

    @bthreads for n in names(dataframe)
        assignnew!(dataframe[!, n], idx)
    end
    dataframe
end

# faster sort of dataframe
function fsort!(dataframe, by)
    _, idx = sorttwo!(dataframe[!, by], collect(1:nrow(dataframe)))

    all_except_by = setdiff(names(dataframe), [by])
     @bthreads for n in all_except_by
        assignnew!(dataframe[!, n], idx)
    end
    dataframe
end

# generate a dataframe with 300 columns
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
sort_time = @elapsed sort!(df, "x1") # takes 14 seconds

@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort2_time = @elapsed fsort2!(df, "x1") # takes 1 seconds

# sorting
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort_time = @elapsed fsort!(df, "x1") # takes 1 seconds

using Plots
plot(
    ["DataFrames.jl sort", "New Sort", "New Sort with SortingLab.jl"], 
    [sort_time, fsort2_time, fsort_time]; title = "Sorting DataFrames", seriestype=:bar)

Topic		Replies	Views
ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames Package Announcements sort , dataframes	6	888	June 26, 2020
Any way to speed up sorting a dataframe? A more efficient sortperm would be great General Usage sort , sortperm , dataframes	2	939	June 25, 2020
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	938	December 6, 2017
Issorted for Dataframe rows Performance	5	356	September 8, 2020
Sort DataFrame by the greater of multiple columns New to Julia question , sort , dataframes	5	508	January 4, 2023

Faster sorting of DataFrames.jl via background ordering

Related topics