Sorting seems to have some low hanging speed fruit for sorting by single column

xiaodai · December 6, 2017, 1:18pm

I was doing some benchmarking of data.table vs DataFrames.jl. Of course, data.table is still way faster on sorting. But I found a low hanging fruit for sorting performance; it could be the backbone of a PR. Here is an MWE: basically I found that if there is only one column in the cols argument of sort, then I can simply do a sortperm on the one column vector and then apply to the rest of the Dataframe for a 4x speed up; this is implemented in fsort. Btw, this is still 10x slower than data.table so there must be other efficiencies we can find.


using DataFrames

const N = Int(1e8)
testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N))

function fsort(df::DataFrame, cols)
    x = df[cols]
    df[sortperm(x),:]
end

@time fsort(testdf, :small_n_grps) #10.5 seconds
@time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds

nalimilan · December 6, 2017, 1:53pm

Thanks for continuing your investigations. However, I’m not sure it’s a good idea to improve speed for special cases, especially since you note that the improvement is still not enough to get close to data.table. I’d rather try to improve the general algorithm.

Indeed it’s not clear to me why an algorithm accepting multiple columns should be slower when passed a single column than your specialized algorithm. The overhead due to looping over the sorting columns (in this case, a single column) should be negligible. There may be low-hanging fruits in the general algorithm too.

xiaodai · December 6, 2017, 1:54pm

Maybe that’s my christmas project if no one beats me to it.

Dan · December 6, 2017, 3:05pm

@time sort(testdf, cols = :small_n_grps)

instead of

@time sort(testdf, cols = [:small_n_grps]) # 40.4 seconds

recovers some performance on my machine.

piever · December 6, 2017, 3:42pm

Not sure if it is related, but DataFrames sort! uses Base.permute!! (which is permute! minus a copy) and when looking for it I get the following:

help?> permute!
search: permute! ipermute! permutedims! permute permutedims PermutedDimsArray

  permute!(v, p)

  Permute vector v in-place, according to permutation p. No checking is done
  to verify that p is a permutation.

  To return a new permutation, use v[p]. Note that this is generally faster
  than permute!(v,p) for large vectors.

Instead IndexedTables does something similar but instead of permute!(v, p) they use:

copy!(v, v[p])

I don’t know if this is the cause of the performance difference but maybe it’s a relevant thing to keep in mind.

xiaodai · December 6, 2017, 8:06pm

The documenter on indexedtable is a bit sparse atm. Will look into it once
its more stable.

piever · December 6, 2017, 8:44pm

I’ve tested the implementation in this PR in IndexedTables: it is more or less as fast as yours in this specific case and it indeed does very similar things. The surprising part is that most of the time is actually spent in:

df[p, :]

That is to say, getting the rearreanged version of the vectors. I really don’t understand how data.table could speed that up so much. I would expect that v[p] with v a vector and p a vector of integers is reasonably fast in Julia. Do you know what data.table’s algorithm does differently?

xiaodai · December 6, 2017, 10:21pm

Perhaps it’s df[p,:] that’s inefficient? R’s data.table is formidably optimised in many cases I find but I am not sure exactly what’s going.

I tried to do the group by :large_n_grps and data.table advantage is now only 2x.

using DataFrames

const N = Int(1e8)
testdf = DataFrame(large_n_grps = rand(1:Int32(N/100), N), small_n_grps = rand(1:100, N), v1 = rand(1:5, N))

function fsort(df::DataFrame, cols)
    x = df[cols]
    df[sortperm(x),:]
end

@time fsort(testdf, :large_n_grps)
@time sort(testdf, cols = [:large_n_grps])

Topic		Replies	Views
Any way to speed up sorting a dataframe? A more efficient sortperm would be great General Usage sort , sortperm , dataframes	2	948	June 25, 2020
ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames Package Announcements sort , dataframes	6	891	June 26, 2020
Faster sorting of DataFrames.jl via background ordering Performance sort , dataframes	0	1423	August 17, 2020
Sort DataFrame by the greater of multiple columns New to Julia question , sort , dataframes	5	523	January 4, 2023
Issorted for Dataframe rows Performance	5	357	September 8, 2020

Sorting seems to have some low hanging speed fruit for sorting by single column

Related topics