ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames

xiaodai · June 25, 2020, 11:58am

In the latest version of DataConvience.jl you get functions fsort and fsort! for faster sorting of DataFrames. Sorry currently only ascending order sort is possible. Test out the speed difference for yourself! For me I get 8x speed up for some of tests.

sort_vs_fsort

using DataConvenience, DataFrames
using Random
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))

time1 =@elapsed df1 = sort(df, :int);
time2 =@elapsed df2 = fsort(df, :int);
df1 == df2
df != df2

time3 =@elapsed df1 = sort(df, :str);
time4 =@elapsed df2 = fsort(df, :str);
df1 == df2
df != df2

time5 =@elapsed df1 = sort(df, [:str, :float]);
time6 =@elapsed df2 = fsort(df, [:str, :float]);
df1 == df2
df != df2

using Plots
using StatsPlots
groupedbar(
    repeat(["sort Int","sort String", "sort String and Float64"], inner = 2),
    [time1, time2, time3, time4, time5, time6],
    group = repeat(["`sort`", "`fsort`"], outer=3),
    title = "Sort 100m rows: `sort` vs `fsort`")
savefig("benchmarks/sort_vs_fsort.png")

Ideally, contributing a faster sortperm back to base or SortingAlgorithms.jl would be key to performance. However, until that happens you have DataConvenience.fsort.

robsmith11 · June 25, 2020, 2:58pm

I’m surprised that there’s still so much more room for improvement for simple sorting of vectors. I thought sorting was such a well studied problem that Base would already have optimal implementations of sort/sortperm.

What’s holding up the replacement of Base’s implementations with your fsort/fsortperm? Sorting is so ubiquitous that this seems like a big performance win for everyone. Are there corner cases or any downsides?

xiaodai · June 26, 2020, 4:13am

The issue is not sort but sortperm. I think DataFrames.jl will get much faster for non-string once this progresses further: https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/33

For strings though, we still need SortingLab.jl as only SortingLab.jl has a faster string sort or when this goes in https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/27

See related issues https://github.com/JuliaLang/julia/pull/24772

And https://github.com/JuliaLang/julia/issues/24770

robsmith11 · June 26, 2020, 6:23am

My comment was not based on the performance of DataFrame sorting, but rather on a quick test I had done of Base’s default sort and your fsort.

After further testing, I now see that the downside of RadixSort, as used by fsort, is that it’s performance doesn’t improve when sorting an array that’s already or mostly sorted, but the default sorting in Base does.

It’s also good to see that median and quantile have performance in line with RadixSort. So yeah, Base’s sort is fine, but one should be aware that RadixSort will greatly improve performance on mostly unsorted arrays.

julia> x = rand(10^7);

julia> @btime sort($x);
  672.342 ms (2 allocations: 76.29 MiB)

julia> @btime sort($x, alg=RadixSort);
  269.020 ms (18 allocations: 152.78 MiB)

julia> sx = sort(x);

julia> @btime sort($sx);
  128.143 ms (2 allocations: 76.29 MiB)

julia> @btime sort($sx, alg=RadixSort);
  266.842 ms (18 allocations: 152.78 MiB)

julia> @btime median($x);
  163.452 ms (2 allocations: 76.29 MiB)

julia> @btime quantile($x, 0.5);
  163.203 ms (2 allocations: 76.29 MiB)

xiaodai · June 26, 2020, 6:42am

Radix sort uses double the memory. Radixsort for int and float is already available in SortingAlgorithms.jl as u saw

tkf · June 26, 2020, 8:48pm

IIRC, both SortingAlgorithms.jl and SortingLab.jl use LSD radix sort. Does somebody compare against MSD radix sort?

My naive guess is that MSD is better than LSD for variable length elements like strings and vectors and also work nicely with task-based parallelism. I also wonder how parallel multi-pivot quick sort compare with parallel radix sort. (In fact, I wrote parallel MSD radix sort and meant to release it. But I haven’t have time to polish it…)

xiaodai · June 26, 2020, 10:29pm

Parallel MSD. I had experiment with MSD pre Julia 1.3 and I didn’t understand alot of concepts like allocations etc then. So it was much slower. I wonder if MSD using threading would work better

Topic		Replies	Views
Any way to speed up sorting a dataframe? A more efficient sortperm would be great General Usage sort , sortperm , dataframes	2	947	June 25, 2020
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	939	December 6, 2017
Faster sorting of DataFrames.jl via background ordering Performance sort , dataframes	0	1423	August 17, 2020
DataFrame sort Performance using Query.jl vs SAS PROC SORT Performance query , sort , dataframes	19	1530	August 11, 2023
Issorted for Dataframe rows Performance	5	357	September 8, 2020

ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames

Related topics