Sorting seems to have some low hanging speed fruit for sorting by single column

I was doing some benchmarking of data.table vs DataFrames.jl. Of course, data.table is still way faster on sorting. But I found a low hanging fruit for sorting performance; it could be the backbone of a PR. Here is an MWE: basically I found that if there is only one column in the cols argument of sort, then I can simply do a sortperm on the one column vector and then apply to the rest of the Dataframe for a 4x speed up; this is implemented in fsort. Btw, this is still 10x slower than data.table so there must be other efficiencies we can find.


using DataFrames

const N = Int(1e8)
testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N))

function fsort(df::DataFrame, cols)
    x = df[cols]
    df[sortperm(x),:]
end

@time fsort(testdf, :small_n_grps) #10.5 seconds
@time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds

Thanks for continuing your investigations. However, I’m not sure it’s a good idea to improve speed for special cases, especially since you note that the improvement is still not enough to get close to data.table. I’d rather try to improve the general algorithm.

Indeed it’s not clear to me why an algorithm accepting multiple columns should be slower when passed a single column than your specialized algorithm. The overhead due to looping over the sorting columns (in this case, a single column) should be negligible. There may be low-hanging fruits in the general algorithm too.

3 Likes

Maybe that’s my christmas project if no one beats me to it.

3 Likes

@time sort(testdf, cols = :small_n_grps)

instead of

@time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds

recovers some performance on my machine.

Not sure if it is related, but DataFrames sort! uses Base.permute!! (which is permute! minus a copy) and when looking for it I get the following:

help?> permute!
search: permute! ipermute! permutedims! permute permutedims PermutedDimsArray

  permute!(v, p)

  Permute vector v in-place, according to permutation p. No checking is done
  to verify that p is a permutation.

  To return a new permutation, use v[p]. Note that this is generally faster
  than permute!(v,p) for large vectors.

Instead IndexedTables does something similar but instead of permute!(v, p) they use:

copy!(v, v[p])

I don’t know if this is the cause of the performance difference but maybe it’s a relevant thing to keep in mind.

1 Like

The documenter on indexedtable is a bit sparse atm. Will look into it once
its more stable.

I’ve tested the implementation in this PR in IndexedTables: it is more or less as fast as yours in this specific case and it indeed does very similar things. The surprising part is that most of the time is actually spent in:

df[p, :]

That is to say, getting the rearreanged version of the vectors. I really don’t understand how data.table could speed that up so much. I would expect that v[p] with v a vector and p a vector of integers is reasonably fast in Julia. Do you know what data.table’s algorithm does differently?

Perhaps it’s df[p,:] that’s inefficient? R’s data.table is formidably optimised in many cases I find but I am not sure exactly what’s going.

I tried to do the group by :large_n_grps and data.table advantage is now only 2x.

using DataFrames

const N = Int(1e8)
testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N))

function fsort(df::DataFrame, cols)
    x = df[cols]
    df[sortperm(x),:]
end

@time fsort(testdf, :large_n_grps)
@time sort(testdf, cols = [:large_n_grps])