ANN: DataConvience.jl v0.1.4 - faster sorting for DataFrames

In the latest version of DataConvience.jl you get functions fsort and fsort! for faster sorting of DataFrames. Sorry currently only ascending order sort is possible. Test out the speed difference for yourself! For me I get 8x speed up for some of tests.

sort_vs_fsort

using DataConvenience, DataFrames
using Random
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))

time1 =@elapsed df1 = sort(df, :int);
time2 =@elapsed df2 = fsort(df, :int);
df1 == df2
df != df2

time3 =@elapsed df1 = sort(df, :str);
time4 =@elapsed df2 = fsort(df, :str);
df1 == df2
df != df2

time5 =@elapsed df1 = sort(df, [:str, :float]);
time6 =@elapsed df2 = fsort(df, [:str, :float]);
df1 == df2
df != df2

using Plots
using StatsPlots
groupedbar(
    repeat(["sort Int","sort String", "sort String and Float64"], inner = 2),
    [time1, time2, time3, time4, time5, time6],
    group = repeat(["`sort`", "`fsort`"], outer=3),
    title = "Sort 100m rows: `sort` vs `fsort`")
savefig("benchmarks/sort_vs_fsort.png")

Ideally, contributing a faster sortperm back to base or SortingAlgorithms.jl would be key to performance. However, until that happens you have DataConvenience.fsort.

4 Likes

I’m surprised that there’s still so much more room for improvement for simple sorting of vectors. I thought sorting was such a well studied problem that Base would already have optimal implementations of sort/sortperm.

What’s holding up the replacement of Base’s implementations with your fsort/fsortperm? Sorting is so ubiquitous that this seems like a big performance win for everyone. Are there corner cases or any downsides?

The issue is not sort but sortperm. I think DataFrames.jl will get much faster for non-string once this progresses further: https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/33

For strings though, we still need SortingLab.jl as only SortingLab.jl has a faster string sort or when this goes in https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/27

See related issues https://github.com/JuliaLang/julia/pull/24772

And https://github.com/JuliaLang/julia/issues/24770

1 Like

My comment was not based on the performance of DataFrame sorting, but rather on a quick test I had done of Base’s default sort and your fsort.

After further testing, I now see that the downside of RadixSort, as used by fsort, is that it’s performance doesn’t improve when sorting an array that’s already or mostly sorted, but the default sorting in Base does.

It’s also good to see that median and quantile have performance in line with RadixSort. So yeah, Base’s sort is fine, but one should be aware that RadixSort will greatly improve performance on mostly unsorted arrays.

julia> x = rand(10^7);

julia> @btime sort($x);
  672.342 ms (2 allocations: 76.29 MiB)

julia> @btime sort($x, alg=RadixSort);
  269.020 ms (18 allocations: 152.78 MiB)

julia> sx = sort(x);

julia> @btime sort($sx);
  128.143 ms (2 allocations: 76.29 MiB)

julia> @btime sort($sx, alg=RadixSort);
  266.842 ms (18 allocations: 152.78 MiB)

julia> @btime median($x);
  163.452 ms (2 allocations: 76.29 MiB)

julia> @btime quantile($x, 0.5);
  163.203 ms (2 allocations: 76.29 MiB)

Radix sort uses double the memory. Radixsort for int and float is already available in SortingAlgorithms.jl as u saw

IIRC, both SortingAlgorithms.jl and SortingLab.jl use LSD radix sort. Does somebody compare against MSD radix sort?

My naive guess is that MSD is better than LSD for variable length elements like strings and vectors and also work nicely with task-based parallelism. I also wonder how parallel multi-pivot quick sort compare with parallel radix sort. (In fact, I wrote parallel MSD radix sort and meant to release it. But I haven’t have time to polish it…)

Parallel MSD. I had experiment with MSD pre Julia 1.3 and I didn’t understand alot of concepts like allocations etc then. So it was much slower. I wonder if MSD using threading would work better

1 Like