I was doing some benchmarking of data.table vs DataFrames.jl. Of course,
data.table is still way faster on sorting. But I found a low hanging fruit for sorting performance; it could be the backbone of a PR. Here is an MWE: basically I found that if there is only one column in the
cols argument of
sort, then I can simply do a
sortperm on the one column vector and then apply to the rest of the Dataframe for a 4x speed up; this is implemented in
fsort. Btw, this is still 10x slower than
data.table so there must be other efficiencies we can find.
using DataFrames const N = Int(1e8) testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N)) function fsort(df::DataFrame, cols) x = df[cols] df[sortperm(x),:] end @time fsort(testdf, :small_n_grps) #10.5 seconds @time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds