Faster sorting of DataFrames.jl via background ordering

I have had this idea for a long time, now I finally gave it a go and the results are amazing. So here’s the idea.

Here sorting a large dataframe the algorithm roughly goes like this

  1. Perform a sortperm on the columns to sort by. This returns an idx of indices.
  2. Consider colX which his not a sort by column then colX[idx] would give order the colX in the right order.
  3. Repeat number 2 for all columns.

There are room for improvement for 1 and 2 but 3 is low-hanging fruit.

Once the dataframe is sorted, most users wouldn’t start using all columns in the dataframe. For example, printing to REPL only prints the few columns. Here is our opportunity and idea

Why not just return the dataframe and re-order the columns in the background.

If the user requests a certain column then wait for that column to be sorted and return the sorted column.

Below is a very crude implementation of that idea using ThreadPools.jl (which allows background threads).

But see the “performance” difference is stark!

image

If you did read the source you may be wondering, if someone uses the column before they have finished sorting that would cause trouble. True, but that’s I haven’t implemented the full idea which is replaced each column with a InProgressVector and when you access the InProgressVector it will be waiting for the sort to finish and return the sorted Vector AND it will replace itself in the DataFrame with the properly sorted vector. I haven’t implemented those yet, but you get the idea.

using ThreadPools: @bthreads
using DataFrames
using SortingLab: sorttwo!

# function barrier for assigining values
assignnew!(v, idx) = v .= v[idx]

function fsort2!(dataframe, by)
    idx = sortperm(dataframe[!, by])

    @bthreads for n in names(dataframe)
        assignnew!(dataframe[!, n], idx)
    end
    dataframe
end

# faster sort of dataframe
function fsort!(dataframe, by)
    _, idx = sorttwo!(dataframe[!, by], collect(1:nrow(dataframe)))

    all_except_by = setdiff(names(dataframe), [by])
     @bthreads for n in all_except_by
        assignnew!(dataframe[!, n], idx)
    end
    dataframe
end

# generate a dataframe with 300 columns
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
sort_time = @elapsed sort!(df, "x1") # takes 14 seconds

@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort2_time = @elapsed fsort2!(df, "x1") # takes 1 seconds

# sorting
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort_time = @elapsed fsort!(df, "x1") # takes 1 seconds

using Plots
plot(
    ["DataFrames.jl sort", "New Sort", "New Sort with SortingLab.jl"], 
    [sort_time, fsort2_time, fsort_time]; title = "Sorting DataFrames", seriestype=:bar)
4 Likes