I also think that the most typical use case is a PooledArray
(some CSV readers, like TextParser, even load this kind of string data into a PooledArray by default). It would also be interesting to see what’s the performance in that case (which no longer has anything to do with strings, it should use the efficient encoding that a PooledArray
provides).
Perhaps I was not clear: the question is in what context would I want to sort N \gg K values per se, of which only K are unique?
In the contexts I use sorting, optimizing this is solving the wrong kind of problem. Eg if I want to sort a table with such strings as part of a lexicographic ordering, I would aim to make string comparisons fast, perhaps by interning them in some kind of a structure which achieves fast comparison that sort
can call. Above some threshold of N/K, my intuition would be that programmer effort is better spent thinking about the context, instead of low-level optimizations.
As an example, consider
using SortingAlgorithms
using DataStructures
idstrvec = rand("id".*dec.(1:1_000_000,10), 100_000_000)
sort(idstrvec[1:10]) # compile
@time sort(idstrvec)
# 80.119106 seconds (88 allocations: 1.118 GiB, 5.72% gc time)
"Return `key=>count` pairs for elements in the argument sorted by the `key`."
function sorted_counts(v)
c = counter(v)
sort(collect(c.map), by = first)
end
sorted_counts(idstrvec[1:10]) # compile
@time sorted_counts(idstrvec)
# 49.105053 seconds (1.00 M allocations: 114.762 MiB, 0.35% gc time)
where a few minutes of effort with standard tools yields a 2x improvement, just by compressing the vector.
Think about sortperm
which relies on a fast underlying sorting algorithm. You have two vectors, one is string e.g. the member_id and the second the amount of money spent by the member in one transaction; two vectors are of the same length, e.g. two columns in a dataframe. The member may have spent money multiple times, so the string can have repeats. Say you wanted to understand, say, the median, q25, q75 spend of every member, without using an approximation. You can sortperm
on the customer’s member_id and apply the returned indexes to the transactions, and then compute the stats. Very common stuff in data manipulation. There are better algorithms for median, q25, q75, that doesn’t need sorting, so maybe not a perfect example, but once you sort them you can then apply an arbitrary function on all values belonging to the same group.
anyway the post is about an implementation of string radix sort in Julia which wasn’t available, and as shown leads to faster sorting. Sorting is generally useful, with repeated values or not.
there are more specialised sorting algorithms (counting sort) that can apply to categoricalarrays and pooledarrays. those will be useful as well when implemented.
Before going to sleep I tried these two ideas, expecting nothing but…
String sorting performance vs R is only marginally (within 1.x-1.3x) slower now! In my benchmarks sorting 10m records where 100k of them are distinct yields 1.31secs (Julia) vs 1.1secs (R) (for 100m it’s around 25-27 seconds vs 18 -21 seconds).
I made these 3 changes
- Use my own sort function instead of
Base.sort
which is hard to overload asBase.sort
doesn’t dispatch onalg
, or more generally on named arguments. Some howBase.sort(svec, alg = StringRadixSort)
is running some code which probably wasn’t all that necessary. - Change the
SortingAlgorithms.RADIXSIZE
to 16 from 11. This means sort 16 bits in one go instead of 11 bits. After getting comfortable with radixsort, I was curious whySortingAlgorithms.jl
chose 11 as theRADIXSIZE
, the most common types are probably multiples of 16 bits, e.g. UInt32 is 32 bits and UInt64 is 64 bits. Sorting 11 bits in one go, we need 3 passes over the data to sort 32 bit type and 6 passes over the data for a 64 bit type vs 2 and 4 passes if we just choose aRADIXSIZE
of 16 bits. From benchmarking, increasing theRADIXSIZE
results in faster sorting too. - Do not load underlying bits into
UInt128
and instead useUInt64
as the maximum. I found that it’s quicker to sort twice usingUInt64
instead of sorting onceUInt128
. When I said sort “once” the data has been gone through 4 times when sorting the underlying bytes as 64bit; so really the number of passes over the data is the same whether you useUInt128
orUInt64
but it seems Julia can handle 64bit data better which makes sense given that I am running on a 64bit machine.
Wow! So you matched data.tables’ performance without relying on string interning? I guess that means they actually don’t use it at all?
It would be interesting to investigate whether something should be improved in Base.sort
.
There seems to be very old issues about sort
performance:
https://github.com/JuliaLang/julia/issues/939
Maybe this radixsort
could be of use more generally?
I think you’d need do something to get it to use the SSE2 or AVX registers for 128 bits to be faster,
I was thinking of using UInt128 for some of my new string code, but the code generated used multiple 64-bit registers, instead of one SSE2 register, but I only looked at a few functions.
The [radix sort] implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.
That is from R’s documentation on sort
. I thinking they are using it somehow.
These is still a gap but the difference looks small on 10m string. I think there is 20%-30% performance gap still.
Yeah, R uses radix sort as default for numbers and factors. I actually found the sorting algorithm in v0.6.2 to be very fast for integers and floats. Strings and sortperm
is a bit slow.
One issue is that I think strings use a merge sort by default. Just trying quicksort might help a lot. That could easily be made the default. If a sufficiently generic radix sort for builtin types exists, we could make it the default as well. Not a top priority right now but a good post-1.0 project.
For the case of number of strings >> number of unique strings, how costly would it be to pass over the string vector, counting how many instances of each unique string, sort the unique strings, and then replicate each by their frequency of occurrence?
this would require a hash table if you don’t assume prior knowledge. What you described is the classic counting sort. From my own testing, it was slower.
Radix sort uses an array to keep counting instead of a hash table. It recognizes that you can’t have a 2^64
array of to hold the counts so it does it in stages by holding 2^n
array of counts where n < 64
and do multiple passes over the data.
A faster sort
for strings has been published as part of SortingLab.jl
. The function is radixsort
example code below
Pkg.add("SortingLab") # run once
sv = [randstring(8) for i=1:Int(1e6)];
svs = radixsort(sv) # faster than sort
svsp = fsortperm(sv) .|> Int # faster than sortperm
issorted(sv[svsp])
My PR to SortingAlgorithms.jl has hit a little bit of a snag, hence this new package - SortingLab.jl; but I do intend to use SortingLab.jl
as more or less a playground for new ideas. Mature ideas that work I will try to contribute back to Base or SortingAlgorithms.
Nice!
how does it do with strings of lengths 16 and varying lengths 4…16?
Still twice as fast.
x = [randstring(rand(4:16)) for i = 1:1_000_000]
using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x) #0.4
@time sort(x) # 0.89
# fixed length 16
x = [randstring(16) for i = 1:1_000_000]
using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x)
@time sort(x)
@time sort(x, alg=QuickSort)
If the strings are just 15 bytes or shorter, then can use ShortStrings.jl (unregistered as of 8th Feb 2018; but is being registered) for even faster sorting.
using ShortStrings
x = [randstring(rand(4:15)) for i = 1:1_000_000]
sx = ShortString15.(x)
ShortStrings.fsort(sx[1:1]) # compile
sort(x[1:1])
@time ShortStrings.fsort(sx) #0.23s
@time sort(sx) #1.27s
do you expect 8bit chars? the Char type is 32 bits … how is this handled?
Currently not handled. So ascii only. I should make it clear and I have
absolutely no expertise with Unicode and Char etc. So I made these purely
for the Fannie Mae dats and is not production ready.