WIP: faster string sort

R’s Internals documentation discusses how R stores the strings and they do use some kind of “cache/look up” table for their strings; probably to save on RAM usage. So it’s not really fair to compare to the R implementation yet.

Two potential advantages are lower RAM usage, and very fast comparison because strcmp becomes a pointer comparison rather than character-by-character. The trade-off is that every string construction then requires a hash table look up.

The computer science name for this technique is string interning. Julia’s Symbol type is interned like that, and I believe I’ve seen posts about (ab)using it for general string interning – but I’m not sure if such a use-case is intended :smile:.
See also: [ANN] InternedStrings.jl: Allocate strings once and reuse them

Use bswap(value), it usually ends up being a single instruction (rolw, bswapl, bswapq)

there is also ntoh for endiness conversion. i will test them out

Those are just synonyms for bswap, such that if you use hton, on a little-endian host, it will convert it to network order (i.e. bigendian), but on a big-endian host, it’s a no-op.
You should use hton not ntoh though, just to keep the semantics correct, even though functions end up being the same on all current machines.

It will be really good if I can get some help with the following posts. It will help me create a faster pure Julia string sorting algorithm

Assign values in user-defined primitive type

bswap_int for user-defined primitive type

I have made some improvements to string sorting and have made a PR to SortingAlgorithms.jl

See

https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/27

2 Likes

I will take a rest on string sorting now and focus on string-grouping instead.

2 Likes

First draft of the blog post about string sorting

https://www.codementor.io/zhuojiadai/draft/f57tf9f8s

3 Likes

Great!

My two cents: I’d mention the fact that R uses string interning directly when presenting the results. People may not read the full post, or may just skip it, in which case they will not understand the full story. Also, interning strings requires an additional computation step that is not accounted for, so this is not a completely fair comparison. For example, if you computed the time required to read a CSV file and sort it, R’s advantage would probably be less striking.

When you mention InternedStrings.jl, you could also point to PooledArrays and CategoricalArrays, which can be useful alternatives when you have a small proportion of unique values. So while it may seem at first that Julia is lacking compared to R because it does not intern strings, you have actually three different choices depending on your needs in Julia. OTC in R you cannot opt out of string interning if it’s not useful for your use case.

7 Likes

You can also use Julia’s Symbol type if you need to have something interned (although from what I understand, those will never get garbage collected).

1 Like

Great point. Added.

Added a side note. Also R and pands also has categorical arrays.

I am sure some useRs will start picking a fight about Julia package load speeds and do a comparison. Fun to add though.

I may be missing something about the exercise, but if I have N unique strings among K \gg N values, what’s the point of sorting the K values? Wouldn’t it be better to just sort the N unique values?

Also, if you are thinking about interning and want repeated access to the sorted values, consider sorted containers.

My impression is that you are focusing on low-level optimizations, which is occasionally useful, but rarely gives more than 2-5x improvements in an already optimized language like Julia. Thinking about the broad algorithm usually has higher payoffs; @ChrisRackauckas recently wrote a nice blog post about this.

2 Likes

I’d happy with 2x-5x performance improvement for sorting! The cover photo shows radix sort is faster than what Julia has in Base. I used to test cases were 100m records and R can sort them in 20sec but used to take 60 secs in Julia. Now it take 20~30 seconds in some cases. Julia can appear slow at data manipulation tasks (vs R’s data.table) so this is just one of many steps towards better performance.

Also O(NlogN) is a bound for comparison based sorting and O(wN) is the sorting speed bound for radixsort where w is the number of passes (6 for 64bit data given SortingAlgorithm sorts 11bits at once), and is one of the best known non-comparison based sorts. Happy to find out algorithms that can beat this and get 5x+ performance increase.

For radix sort you have to focus on low-level. I did mention high level optimization that might be possible after nalimilan’s comments though; also I have touched on similar theme’s to Chris’ posts e.g. this.

1 Like

I also think that the most typical use case is a PooledArray (some CSV readers, like TextParser, even load this kind of string data into a PooledArray by default). It would also be interesting to see what’s the performance in that case (which no longer has anything to do with strings, it should use the efficient encoding that a PooledArray provides).

2 Likes

Perhaps I was not clear: the question is in what context would I want to sort N \gg K values per se, of which only K are unique?

In the contexts I use sorting, optimizing this is solving the wrong kind of problem. Eg if I want to sort a table with such strings as part of a lexicographic ordering, I would aim to make string comparisons fast, perhaps by interning them in some kind of a structure which achieves fast comparison that sort can call. Above some threshold of N/K, my intuition would be that programmer effort is better spent thinking about the context, instead of low-level optimizations.

As an example, consider

using SortingAlgorithms
using DataStructures
idstrvec = rand("id".*dec.(1:1_000_000,10), 100_000_000)
sort(idstrvec[1:10])            # compile
@time sort(idstrvec)
#  80.119106 seconds (88 allocations: 1.118 GiB, 5.72% gc time)

"Return `key=>count` pairs for elements in the argument sorted by the `key`."
function sorted_counts(v)
    c = counter(v)
    sort(collect(c.map), by = first)
end

sorted_counts(idstrvec[1:10])   # compile
@time sorted_counts(idstrvec)
# 49.105053 seconds (1.00 M allocations: 114.762 MiB, 0.35% gc time)

where a few minutes of effort with standard tools yields a 2x improvement, just by compressing the vector.

Think about sortperm which relies on a fast underlying sorting algorithm. You have two vectors, one is string e.g. the member_id and the second the amount of money spent by the member in one transaction; two vectors are of the same length, e.g. two columns in a dataframe. The member may have spent money multiple times, so the string can have repeats. Say you wanted to understand, say, the median, q25, q75 spend of every member, without using an approximation. You can sortperm on the customer’s member_id and apply the returned indexes to the transactions, and then compute the stats. Very common stuff in data manipulation. There are better algorithms for median, q25, q75, that doesn’t need sorting, so maybe not a perfect example, but once you sort them you can then apply an arbitrary function on all values belonging to the same group.

anyway the post is about an implementation of string radix sort in Julia which wasn’t available, and as shown leads to faster sorting. Sorting is generally useful, with repeated values or not.

there are more specialised sorting algorithms (counting sort) that can apply to categoricalarrays and pooledarrays. those will be useful as well when implemented.

Before going to sleep I tried these two ideas, expecting nothing but…

String sorting performance vs R is only marginally (within 1.x-1.3x) slower now! In my benchmarks sorting 10m records where 100k of them are distinct yields 1.31secs (Julia) vs 1.1secs (R) (for 100m it’s around 25-27 seconds vs 18 -21 seconds).

sort_perf_10m100_u

I made these 3 changes

  1. Use my own sort function instead of Base.sort which is hard to overload as Base.sort doesn’t dispatch on alg, or more generally on named arguments. Some how Base.sort(svec, alg = StringRadixSort) is running some code which probably wasn’t all that necessary.
  2. Change the SortingAlgorithms.RADIXSIZE to 16 from 11. This means sort 16 bits in one go instead of 11 bits. After getting comfortable with radixsort, I was curious why SortingAlgorithms.jl chose 11 as the RADIXSIZE, the most common types are probably multiples of 16 bits, e.g. UInt32 is 32 bits and UInt64 is 64 bits. Sorting 11 bits in one go, we need 3 passes over the data to sort 32 bit type and 6 passes over the data for a 64 bit type vs 2 and 4 passes if we just choose a RADIXSIZE of 16 bits. From benchmarking, increasing the RADIXSIZE results in faster sorting too.
  3. Do not load underlying bits into UInt128 and instead use UInt64 as the maximum. I found that it’s quicker to sort twice using UInt64 instead of sorting once UInt128. When I said sort “once” the data has been gone through 4 times when sorting the underlying bytes as 64bit; so really the number of passes over the data is the same whether you use UInt128 or UInt64 but it seems Julia can handle 64bit data better which makes sense given that I am running on a 64bit machine.
9 Likes