Progress towards faster `sortperm` for Strings

performance
sortperm

#1

I have just implemented an idea I had to make sortperm faster for Strings. It is implemented in SortingLab.jl's fsortperm; and I got the performance on a particular benchmark to be 7x faster than Base.sortperm, see code here

Here is a rough sketch of how it works

  1. For each value in a string vector, load the last 4 bytes of the string into x::UInt64 where the last 32 bits are all 0 and only the first 32 bits are populated with 4 bytes of the string
  2. Populate the last 32 bits of x with the string’s position in the array
  3. use a modified LSD radix sort that only sorts on the first 32 bits; but note that the position of the string is also being permuted at the same time
  4. Obtain the last 4 bytes of each of the elements in the sorted x's and store them in ss; note that ss is the original index position permuted
  5. Create a view of the string vector permuted by ss, i.e @view(orig_string_vector[ss])
  6. repeat from step 1, but instead of loading the last 4 bytes from the string vector, load the next 4 bytes (counting in reverse) in the string vector view from step 5, and populate the last 32 bits of each x with the corresponding ss values; this is the iterative sorting step in LSD radixsort

Basically, I found that it’s less expensive to pack 4 bytes of a string and its array position into one 64 bit value and sort that instead of sorting the strings and keeping another vector of permuted indexes (as in this SortingAlgorithms’s unmerged PR).

However, R’s string performance is another total beast due to data.table’s optimization ideas and clever user of the global cache and string interning. But Julia’s getting closer even without string interning.

Slow sorting performance in Julia has been documented before. But I feel that my approach is too low-level to be a long term solution, but it’s a worthy experiment.


#2

String interning only makes sense if there’s repeated values though, right? So this chart is missing a crucial detail that you’re sampling in a way that has each string repeated about 100 times, but the 7x and chart don’t make reference to the case it’s in. But that’s kind of misleading since it’s not about a 7x, it’s about whether this particular way of optimizing makes sense in most cases, and what case that is.

The better question is, how does this performance difference very and the change of repeats is changed? I would assume from these details that your Julia implementation does better when everything is unique, and somewhere between 0 and 100 there’s a cutoff where it begins to make sense to intern. That would be a nice line chart, change of uniqueness vs relative performance against R. If the cutoff is like 50, then maybe it’s something that only makes sense to assume is better on certain data applications (which R was designed for, while Julia was not necessarily designed to be just for data), requiring an alternative method for this specific application. On the other hand, if the cutoff is low enough then maybe it’s a good thing to always do by default? Just spitballing but you see where I am heading.

Also:

Is that thread-safe?


#3

Also string interning incurs a cost when creating strings which may or may not be worth it depending on the number of times you perform grouping/sorting on the data. To take this into account one would need to include the time needed to create strings from raw bytes, or to read a text file.

In the previous thread you showed a graph where Julia was as fast as R. What’s the difference with the new graph? Is it just sort vs. sortperm?


#4

At some point I played around with a hybrid representation where strings up to 15 bytes were stored inline with one byte of length in such a way that they integer sort in correct string sort order (store the length in the low byte, pad string data with zeros) and as a pointer to string data otherwise. This not only means that short strings can fit in a single register but also that when comparing two short strings the compare is just an integer comparison. This flew for sorting and basically eliminates any advantage of interning (since interning still requires a pointer to the interned string, which is strictly more storage and more indirection). The problem was that making garbage collection work with this representation was beyond our ability at the time. We may be able to explore such an approach in the future again, however.


Reference String Based on Instance?
#5

I don’t think that would be much of a problem now anymore. We’ve significantly cleaned up the the way the GC stuff works on the codegen side and the union work has led to some better factoring of abstractions. Would be good to try again once 0.7 is out the door.


#6

That’s a great insight actually and in my previous posts I have been careful to mention those but just slip my mind this time. But there’s a reason and that is I have to come realise that datasets with repeated strings is the majority case; from my banking example the most common string columns that i need to sort are always the Customer ID or the Account ID which always has repeats. But I will detail the non repeat case for a comparison later. The other key case is skewwd distribution with repeats; think retailer, some 20% of cusomters are responsible for 80% of purchases, so in the transactions table there is a skewed distribution of IDs


#7

I have to think about how to measure this; I can’t think of a perfect way. I have seen that Julia can generate the test data 3x faster than R. But generating synthetic data is not exactly a real life use case.

Yes. For sort Julia is still 20%-30% slower than R from my testing. The sort chart was from 10m length vector with 100k unique values


#8

This will be a great piece of analysis! I will look into it.

Another great question. Interested to find out…


#9

That would be awesome!

I am writing an R package that can manipulate large amount of data on disk. I really wanted to do it in Julia (design is a different to JuliaDB.jl so can’t just use that), but I choose R mainly because data.table is so much faster on grouping/sorting/reading in data etc. Hopefully, DataFrames and JuliaDB can catch up soon! Of course string sorting/grouping is key.

Btw I think that Julia’s string hashing can also be faster.


#10

You could also use Base.crc32c instead of hash for a string hash table, it’s almost twice as fast, and unless you have a huge hash table, will probably give as good results as hash for spreading things out.


#11

Thanks to @ChrisRackauckas’s great suggestion, the best known Julia implementation for string sort actually beats R when the ratio of unique values to vector length is greater than 3:10 in this particular setup-10m id strings with common prefix with variouz number of unique strings

As can be seen Julia’s radix sort outperforms R’s when the number of unique values is large. For this particular case I have 10m-length vector and the cross over point is at 3m unique values.

It is also true that R takes significantly longer to generate the synthetic data, possibly due to the building of global cache for string interning; or it could just be naturally slower, but it’s hard to separate out the two effects.


#12

A faster sortperm for strings has been published as part of SortingLab.jl. The function is fsortperm

example code below

Pkg.add("SortingLab") # run once
sv = [randstring(8) for i=1:Int(1e6)];
svs = radixsort(sv) # faster than sort
svsp = fsortperm(sv) .|> Int # faster than sortperm
issorted(sv[svsp])

#13

Nice work. One thing I want to move other here is our Twitter conversation. I noted that small numbers of unique values is something that would show up a lot in general computing, like sorting filenames in a folder of a webserver, while lots of repeats are found in datasets. You suggested that DataFrames.jl or other data readers could use interned strings so that way the performance is specialized to both domains. I think that’s a really great idea and would be a great use of Julia’s type system since most/all algorithms should work no matter what string type is internally used, and that context switch is a great way to get performance. I would be interested in seeing that investigated in more depth.


#14

We kind of do this already since CSV.jl creates CategoricalArray vectors for columns with a small proportion of unique values. I think it makes sense to use CategoricalArray/PooledArray to represent values with lots of duplicates, and arrays of strings for “real” strings which are almost all different from one another.


#15

yep. R mostly don’t bother with storing strings as factors much anymore probably due to string interning. it would be good to have finer control over these type of things.


#16

I have implemented the short string representation at ShortStrings.jl, and yes it’s blazing fast! I hope that it will make it to Base soon, or at least DataFrames.jl should take note and implement this. It will lead to much faster sorting and group-by operations.

On a side note, I found that being able to define my own types in Julia that is also fast to be such an awesome feature! Definitely a selling point for Julia.


#17

Seems like this could be a pretty massive headache when trying to enable stack allocation / inline allocation of reference objects. It’s still probably worth investigating in the future though, just not sure about priority.

Also, now that we store string bytes inline always, doesn’t that get us pretty far towards making this like a fat-pointer (and we just need to get aforementioned stack-allocation)?