I have had this idea for a long time, now I finally gave it a go and the results are amazing. So here’s the idea.
Here sorting a large dataframe the algorithm roughly goes like this
- Perform a
sortperm
on the columns to sort by. This returns anidx
of indices. - Consider
colX
which his not a sort by column thencolX[idx]
would give order thecolX
in the right order. - Repeat number 2 for all columns.
There are room for improvement for 1 and 2 but 3 is low-hanging fruit.
Once the dataframe is sorted, most users wouldn’t start using all columns in the dataframe. For example, printing to REPL only prints the few columns. Here is our opportunity and idea
Why not just return the dataframe and re-order the columns in the background.
If the user requests a certain column then wait for that column to be sorted and return the sorted column.
Below is a very crude implementation of that idea using ThreadPools.jl (which allows background threads).
But see the “performance” difference is stark!
If you did read the source you may be wondering, if someone uses the column before they have finished sorting that would cause trouble. True, but that’s I haven’t implemented the full idea which is replaced each column with a InProgressVector
and when you access the InProgressVector
it will be waiting for the sort to finish and return the sorted Vector
AND it will replace itself in the DataFrame with the properly sorted vector. I haven’t implemented those yet, but you get the idea.
using ThreadPools: @bthreads
using DataFrames
using SortingLab: sorttwo!
# function barrier for assigining values
assignnew!(v, idx) = v .= v[idx]
function fsort2!(dataframe, by)
idx = sortperm(dataframe[!, by])
@bthreads for n in names(dataframe)
assignnew!(dataframe[!, n], idx)
end
dataframe
end
# faster sort of dataframe
function fsort!(dataframe, by)
_, idx = sorttwo!(dataframe[!, by], collect(1:nrow(dataframe)))
all_except_by = setdiff(names(dataframe), [by])
@bthreads for n in all_except_by
assignnew!(dataframe[!, n], idx)
end
dataframe
end
# generate a dataframe with 300 columns
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
sort_time = @elapsed sort!(df, "x1") # takes 14 seconds
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort2_time = @elapsed fsort2!(df, "x1") # takes 1 seconds
# sorting
@time df = DataFrame([rand(Int, 1_000_000) for i = 1:300])
fsort_time = @elapsed fsort!(df, "x1") # takes 1 seconds
using Plots
plot(
["DataFrames.jl sort", "New Sort", "New Sort with SortingLab.jl"],
[sort_time, fsort2_time, fsort_time]; title = "Sorting DataFrames", seriestype=:bar)