I have had this idea for a long time, now I finally gave it a go and the results are amazing. So here’s the idea.
Here sorting a large dataframe the algorithm roughly goes like this
- Perform a
sortpermon the columns to sort by. This returns an
colXwhich his not a sort by column then
colX[idx]would give order the
colXin the right order.
- Repeat number 2 for all columns.
There are room for improvement for 1 and 2 but 3 is low-hanging fruit.
Once the dataframe is sorted, most users wouldn’t start using all columns in the dataframe. For example, printing to REPL only prints the few columns. Here is our opportunity and idea
Why not just return the dataframe and re-order the columns in the background.
If the user requests a certain column then wait for that column to be sorted and return the sorted column.
Below is a very crude implementation of that idea using ThreadPools.jl (which allows background threads).
But see the “performance” difference is stark!
If you did read the source you may be wondering, if someone uses the column before they have finished sorting that would cause trouble. True, but that’s I haven’t implemented the full idea which is replaced each column with a
InProgressVector and when you access the
InProgressVector it will be waiting for the sort to finish and return the sorted
Vector AND it will replace itself in the DataFrame with the properly sorted vector. I haven’t implemented those yet, but you get the idea.
using ThreadPools: @bthreads using DataFrames using SortingLab: sorttwo! # function barrier for assigining values assignnew!(v, idx) = v .= v[idx] function fsort2!(dataframe, by) idx = sortperm(dataframe[!, by]) @bthreads for n in names(dataframe) assignnew!(dataframe[!, n], idx) end dataframe end # faster sort of dataframe function fsort!(dataframe, by) _, idx = sorttwo!(dataframe[!, by], collect(1:nrow(dataframe))) all_except_by = setdiff(names(dataframe), [by]) @bthreads for n in all_except_by assignnew!(dataframe[!, n], idx) end dataframe end # generate a dataframe with 300 columns @time df = DataFrame([rand(Int, 1_000_000) for i = 1:300]) sort_time = @elapsed sort!(df, "x1") # takes 14 seconds @time df = DataFrame([rand(Int, 1_000_000) for i = 1:300]) fsort2_time = @elapsed fsort2!(df, "x1") # takes 1 seconds # sorting @time df = DataFrame([rand(Int, 1_000_000) for i = 1:300]) fsort_time = @elapsed fsort!(df, "x1") # takes 1 seconds using Plots plot( ["DataFrames.jl sort", "New Sort", "New Sort with SortingLab.jl"], [sort_time, fsort2_time, fsort_time]; title = "Sorting DataFrames", seriestype=:bar)