WIP: faster string sort

xiaodai · January 9, 2018, 2:03pm

Before going to sleep I tried these two ideas, expecting nothing but…

String sorting performance vs R is only marginally (within 1.x-1.3x) slower now! In my benchmarks sorting 10m records where 100k of them are distinct yields 1.31secs (Julia) vs 1.1secs (R) (for 100m it’s around 25-27 seconds vs 18 -21 seconds).

sort_perf_10m100_u

I made these 3 changes

Use my own sort function instead of Base.sort which is hard to overload as Base.sort doesn’t dispatch on alg, or more generally on named arguments. Some how Base.sort(svec, alg = StringRadixSort) is running some code which probably wasn’t all that necessary.
Change the SortingAlgorithms.RADIXSIZE to 16 from 11. This means sort 16 bits in one go instead of 11 bits. After getting comfortable with radixsort, I was curious why SortingAlgorithms.jl chose 11 as the RADIXSIZE, the most common types are probably multiples of 16 bits, e.g. UInt32 is 32 bits and UInt64 is 64 bits. Sorting 11 bits in one go, we need 3 passes over the data to sort 32 bit type and 6 passes over the data for a 64 bit type vs 2 and 4 passes if we just choose a RADIXSIZE of 16 bits. From benchmarking, increasing the RADIXSIZE results in faster sorting too.
Do not load underlying bits into UInt128 and instead use UInt64 as the maximum. I found that it’s quicker to sort twice using UInt64 instead of sorting once UInt128. When I said sort “once” the data has been gone through 4 times when sorting the underlying bytes as 64bit; so really the number of passes over the data is the same whether you use UInt128 or UInt64 but it seems Julia can handle 64bit data better which makes sense given that I am running on a 64bit machine.

Topic		Replies	Views
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4645	January 29, 2018
[ANN] SortingLab.jl - fast sorting algorithms for strings and CategoricalArrays Package Announcements sort , sortperm	2	1112	January 15, 2019
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4666	February 4, 2018
10x faster sortperm() Performance sortperm	10	2701	October 29, 2024
[ANN]: WIP Strs.jl package ready for alpha review and testing Community	15	3260	April 3, 2018

WIP: faster string sort

Related topics