WIP: faster string sort

I’d happy with 2x-5x performance improvement for sorting! The cover photo shows radix sort is faster than what Julia has in Base. I used to test cases were 100m records and R can sort them in 20sec but used to take 60 secs in Julia. Now it take 20~30 seconds in some cases. Julia can appear slow at data manipulation tasks (vs R’s data.table) so this is just one of many steps towards better performance.

Also O(NlogN) is a bound for comparison based sorting and O(wN) is the sorting speed bound for radixsort where w is the number of passes (6 for 64bit data given SortingAlgorithm sorts 11bits at once), and is one of the best known non-comparison based sorts. Happy to find out algorithms that can beat this and get 5x+ performance increase.

For radix sort you have to focus on low-level. I did mention high level optimization that might be possible after nalimilan’s comments though; also I have touched on similar theme’s to Chris’ posts e.g. this.

1 Like

I also think that the most typical use case is a PooledArray (some CSV readers, like TextParser, even load this kind of string data into a PooledArray by default). It would also be interesting to see what’s the performance in that case (which no longer has anything to do with strings, it should use the efficient encoding that a PooledArray provides).

2 Likes

Perhaps I was not clear: the question is in what context would I want to sort N \gg K values per se, of which only K are unique?

In the contexts I use sorting, optimizing this is solving the wrong kind of problem. Eg if I want to sort a table with such strings as part of a lexicographic ordering, I would aim to make string comparisons fast, perhaps by interning them in some kind of a structure which achieves fast comparison that sort can call. Above some threshold of N/K, my intuition would be that programmer effort is better spent thinking about the context, instead of low-level optimizations.

As an example, consider

using SortingAlgorithms
using DataStructures
idstrvec = rand("id".*dec.(1:1_000_000,10), 100_000_000)
sort(idstrvec[1:10])            # compile
@time sort(idstrvec)
#  80.119106 seconds (88 allocations: 1.118 GiB, 5.72% gc time)

"Return `key=>count` pairs for elements in the argument sorted by the `key`."
function sorted_counts(v)
    c = counter(v)
    sort(collect(c.map), by = first)
end

sorted_counts(idstrvec[1:10])   # compile
@time sorted_counts(idstrvec)
# 49.105053 seconds (1.00 M allocations: 114.762 MiB, 0.35% gc time)

where a few minutes of effort with standard tools yields a 2x improvement, just by compressing the vector.

Think about sortperm which relies on a fast underlying sorting algorithm. You have two vectors, one is string e.g. the member_id and the second the amount of money spent by the member in one transaction; two vectors are of the same length, e.g. two columns in a dataframe. The member may have spent money multiple times, so the string can have repeats. Say you wanted to understand, say, the median, q25, q75 spend of every member, without using an approximation. You can sortperm on the customer’s member_id and apply the returned indexes to the transactions, and then compute the stats. Very common stuff in data manipulation. There are better algorithms for median, q25, q75, that doesn’t need sorting, so maybe not a perfect example, but once you sort them you can then apply an arbitrary function on all values belonging to the same group.

anyway the post is about an implementation of string radix sort in Julia which wasn’t available, and as shown leads to faster sorting. Sorting is generally useful, with repeated values or not.

there are more specialised sorting algorithms (counting sort) that can apply to categoricalarrays and pooledarrays. those will be useful as well when implemented.

Before going to sleep I tried these two ideas, expecting nothing but…

String sorting performance vs R is only marginally (within 1.x-1.3x) slower now! In my benchmarks sorting 10m records where 100k of them are distinct yields 1.31secs (Julia) vs 1.1secs (R) (for 100m it’s around 25-27 seconds vs 18 -21 seconds).

sort_perf_10m100_u

I made these 3 changes

  1. Use my own sort function instead of Base.sort which is hard to overload as Base.sort doesn’t dispatch on alg, or more generally on named arguments. Some how Base.sort(svec, alg = StringRadixSort) is running some code which probably wasn’t all that necessary.
  2. Change the SortingAlgorithms.RADIXSIZE to 16 from 11. This means sort 16 bits in one go instead of 11 bits. After getting comfortable with radixsort, I was curious why SortingAlgorithms.jl chose 11 as the RADIXSIZE, the most common types are probably multiples of 16 bits, e.g. UInt32 is 32 bits and UInt64 is 64 bits. Sorting 11 bits in one go, we need 3 passes over the data to sort 32 bit type and 6 passes over the data for a 64 bit type vs 2 and 4 passes if we just choose a RADIXSIZE of 16 bits. From benchmarking, increasing the RADIXSIZE results in faster sorting too.
  3. Do not load underlying bits into UInt128 and instead use UInt64 as the maximum. I found that it’s quicker to sort twice using UInt64 instead of sorting once UInt128. When I said sort “once” the data has been gone through 4 times when sorting the underlying bytes as 64bit; so really the number of passes over the data is the same whether you use UInt128 or UInt64 but it seems Julia can handle 64bit data better which makes sense given that I am running on a 64bit machine.
9 Likes

Wow! So you matched data.tables’ performance without relying on string interning? I guess that means they actually don’t use it at all?

It would be interesting to investigate whether something should be improved in Base.sort.

There seems to be very old issues about sort performance:

https://github.com/JuliaLang/julia/issues/939

Maybe this radixsort could be of use more generally?

I think you’d need do something to get it to use the SSE2 or AVX registers for 128 bits to be faster,
I was thinking of using UInt128 for some of my new string code, but the code generated used multiple 64-bit registers, instead of one SSE2 register, but I only looked at a few functions.

The [radix sort] implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

That is from R’s documentation on sort. I thinking they are using it somehow.

These is still a gap but the difference looks small on 10m string. I think there is 20%-30% performance gap still.

Yeah, R uses radix sort as default for numbers and factors. I actually found the sorting algorithm in v0.6.2 to be very fast for integers and floats. Strings and sortperm is a bit slow.

One issue is that I think strings use a merge sort by default. Just trying quicksort might help a lot. That could easily be made the default. If a sufficiently generic radix sort for builtin types exists, we could make it the default as well. Not a top priority right now but a good post-1.0 project.

For the case of number of strings >> number of unique strings, how costly would it be to pass over the string vector, counting how many instances of each unique string, sort the unique strings, and then replicate each by their frequency of occurrence?

1 Like

this would require a hash table if you don’t assume prior knowledge. What you described is the classic counting sort. From my own testing, it was slower.

Radix sort uses an array to keep counting instead of a hash table. It recognizes that you can’t have a 2^64 array of to hold the counts so it does it in stages by holding 2^n array of counts where n < 64 and do multiple passes over the data.

2 Likes

A faster sort for strings has been published as part of SortingLab.jl. The function is radixsort

example code below

Pkg.add("SortingLab") # run once
sv = [randstring(8) for i=1:Int(1e6)];
svs = radixsort(sv) # faster than sort
svsp = fsortperm(sv) .|> Int # faster than sortperm
issorted(sv[svsp])

My PR to SortingAlgorithms.jl has hit a little bit of a snag, hence this new package - SortingLab.jl; but I do intend to use SortingLab.jl as more or less a playground for new ideas. Mature ideas that work I will try to contribute back to Base or SortingAlgorithms.

8 Likes

Nice!

how does it do with strings of lengths 16 and varying lengths 4…16?

Still twice as fast.

x = [randstring(rand(4:16)) for i = 1:1_000_000]

using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x) #0.4
@time sort(x) # 0.89

# fixed length 16
x = [randstring(16) for i = 1:1_000_000]
using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x)
@time sort(x)
@time sort(x, alg=QuickSort)

If the strings are just 15 bytes or shorter, then can use ShortStrings.jl (unregistered as of 8th Feb 2018; but is being registered) for even faster sorting.

using ShortStrings

x = [randstring(rand(4:15)) for i = 1:1_000_000]
sx =  ShortString15.(x)
ShortStrings.fsort(sx[1:1]) # compile
sort(x[1:1])
@time ShortStrings.fsort(sx) #0.23s
@time sort(sx) #1.27s

do you expect 8bit chars? the Char type is 32 bits … how is this handled?