I was doing some benchmarking of data.table vs DataFrames.jl. Of course, data.table is still way faster on sorting. But I found a low hanging fruit for sorting performance; it could be the backbone of a PR. Here is an MWE: basically I found that if there is only one column in the cols argument of sort, then I can simply do a sortperm on the one column vector and then apply to the rest of the Dataframe for a 4x speed up; this is implemented in fsort. Btw, this is still 10x slower than data.table so there must be other efficiencies we can find.
using DataFrames
const N = Int(1e8)
testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N))
function fsort(df::DataFrame, cols)
x = df[cols]
df[sortperm(x),:]
end
@time fsort(testdf, :small_n_grps) #10.5 seconds
@time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds
Thanks for continuing your investigations. However, I’m not sure it’s a good idea to improve speed for special cases, especially since you note that the improvement is still not enough to get close to data.table. I’d rather try to improve the general algorithm.
Indeed it’s not clear to me why an algorithm accepting multiple columns should be slower when passed a single column than your specialized algorithm. The overhead due to looping over the sorting columns (in this case, a single column) should be negligible. There may be low-hanging fruits in the general algorithm too.
Not sure if it is related, but DataFrames sort! uses Base.permute!! (which is permute! minus a copy) and when looking for it I get the following:
help?> permute!
search: permute! ipermute! permutedims! permute permutedims PermutedDimsArray
permute!(v, p)
Permute vector v in-place, according to permutation p. No checking is done
to verify that p is a permutation.
To return a new permutation, use v[p]. Note that this is generally faster
than permute!(v,p) for large vectors.
Instead IndexedTables does something similar but instead of permute!(v, p) they use:
copy!(v, v[p])
I don’t know if this is the cause of the performance difference but maybe it’s a relevant thing to keep in mind.
I’ve tested the implementation in this PR in IndexedTables: it is more or less as fast as yours in this specific case and it indeed does very similar things. The surprising part is that most of the time is actually spent in:
df[p, :]
That is to say, getting the rearreanged version of the vectors. I really don’t understand how data.table could speed that up so much. I would expect that v[p] with v a vector and p a vector of integers is reasonably fast in Julia. Do you know what data.table’s algorithm does differently?