WIP: faster string sort

The [radix sort] implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

That is from R’s documentation on sort. I thinking they are using it somehow.

These is still a gap but the difference looks small on 10m string. I think there is 20%-30% performance gap still.

Yeah, R uses radix sort as default for numbers and factors. I actually found the sorting algorithm in v0.6.2 to be very fast for integers and floats. Strings and sortperm is a bit slow.

One issue is that I think strings use a merge sort by default. Just trying quicksort might help a lot. That could easily be made the default. If a sufficiently generic radix sort for builtin types exists, we could make it the default as well. Not a top priority right now but a good post-1.0 project.

For the case of number of strings >> number of unique strings, how costly would it be to pass over the string vector, counting how many instances of each unique string, sort the unique strings, and then replicate each by their frequency of occurrence?

1 Like

this would require a hash table if you don’t assume prior knowledge. What you described is the classic counting sort. From my own testing, it was slower.

Radix sort uses an array to keep counting instead of a hash table. It recognizes that you can’t have a 2^64 array of to hold the counts so it does it in stages by holding 2^n array of counts where n < 64 and do multiple passes over the data.

2 Likes

A faster sort for strings has been published as part of SortingLab.jl. The function is radixsort

example code below

Pkg.add("SortingLab") # run once
sv = [randstring(8) for i=1:Int(1e6)];
svs = radixsort(sv) # faster than sort
svsp = fsortperm(sv) .|> Int # faster than sortperm
issorted(sv[svsp])

My PR to SortingAlgorithms.jl has hit a little bit of a snag, hence this new package - SortingLab.jl; but I do intend to use SortingLab.jl as more or less a playground for new ideas. Mature ideas that work I will try to contribute back to Base or SortingAlgorithms.

8 Likes

Nice!

how does it do with strings of lengths 16 and varying lengths 4…16?

Still twice as fast.

x = [randstring(rand(4:16)) for i = 1:1_000_000]

using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x) #0.4
@time sort(x) # 0.89

# fixed length 16
x = [randstring(16) for i = 1:1_000_000]
using SortingLab
radixsort(x[1:1]) # compile
sort(x[1:1])
@time radixsort(x)
@time sort(x)
@time sort(x, alg=QuickSort)

If the strings are just 15 bytes or shorter, then can use ShortStrings.jl (unregistered as of 8th Feb 2018; but is being registered) for even faster sorting.

using ShortStrings

x = [randstring(rand(4:15)) for i = 1:1_000_000]
sx =  ShortString15.(x)
ShortStrings.fsort(sx[1:1]) # compile
sort(x[1:1])
@time ShortStrings.fsort(sx) #0.23s
@time sort(sx) #1.27s

do you expect 8bit chars? the Char type is 32 bits … how is this handled?

Currently not handled. So ascii only. I should make it clear and I have
absolutely no expertise with Unicode and Char etc. So I made these purely
for the Fannie Mae dats and is not production ready.

off-the-cuff interconversion


chars(str::String) = map(x->Char(x[1]), split(str, ""))
ascii(chars::Vector{Char}) = map(Int8, chars)
ascii(str::String) = ascii(chars(str))

chars(ascii::Vector{Int8}) = map(Char, ascii)
string(chars::Vector{Char}) = join(chars, "")
string(ascii::Vector{Int8}) = string(chars(ascii))
julia> str = "string"
julia> ascii_codes = ascii(str)
6-element Array{Int8,1}:
 115
 116
 114
 105
 110
 103

julia> string(ascii_codes)
"string"

8bits is not sufficient condition. For example: CP1250 (central and eastern european encoding) is locale dependent and in hungarian alphabet ‘é’ (0xE9) is less than ‘z’ (0x7A)

@xiaodai maybe rename package/module to ShortAsciiStrings (or types to ShortAsciiString…) would be appropriate?

Fair enough

But those aren’t ASCII strings at all, as ASCII is purely 7-bit.

Short8BitStrings might be more appropriate, if you intend for it to continue to be limited to only 8-bit code units.

7 bit or 7 significant bit?

In other words - how do you want to encode 7bit chars in 64bit architecture?

To be pedantic, 7 signficant bits, yes, but I doubt you’d find many people who’d bother to bring up that distinction. (People old enough to recall when that the 8th bit was used as a check bit for old serial communications, maybe? Or people who were also familar with machines that had 9-bit “bytes” [and 18 and 36 bit architectures :grinning:])

ShortAnsiStrings does what is sought

Maybe off topic but this ( https://social.msdn.microsoft.com/Forums/sqlserver/en-US/816c72ca-b328-4a7c-80b1-c9255968caeb/why-are-there-so-many106-hungarian-collations?forum=transactsql ) shows me that problem with sorting is a little complicated than I thought.

I was thinking about O(N) transformation before sort and after sort to get proper sorting in locale alphabet. (or transformation in constructor and conversion to String)

But simple naive solution could not help for example if we want case insensitive sorting… (it is problem with ShortAnsiStrinsg too)

I hope that (because Julia is so flexible :heart:) there is possibility to create nice Short8bitStrings package supporting collate. (but not sure there is need for it in this moment)