WIP: faster string sort

off-the-cuff interconversion


chars(str::String) = map(x->Char(x[1]), split(str, ""))
ascii(chars::Vector{Char}) = map(Int8, chars)
ascii(str::String) = ascii(chars(str))

chars(ascii::Vector{Int8}) = map(Char, ascii)
string(chars::Vector{Char}) = join(chars, "")
string(ascii::Vector{Int8}) = string(chars(ascii))
julia> str = "string"
julia> ascii_codes = ascii(str)
6-element Array{Int8,1}:
 115
 116
 114
 105
 110
 103

julia> string(ascii_codes)
"string"

8bits is not sufficient condition. For example: CP1250 (central and eastern european encoding) is locale dependent and in hungarian alphabet â€˜Ă©â€™ (0xE9) is less than ‘z’ (0x7A)

@xiaodai maybe rename package/module to ShortAsciiStrings (or types to ShortAsciiString
) would be appropriate?

Fair enough

But those aren’t ASCII strings at all, as ASCII is purely 7-bit.

Short8BitStrings might be more appropriate, if you intend for it to continue to be limited to only 8-bit code units.

7 bit or 7 significant bit?

In other words - how do you want to encode 7bit chars in 64bit architecture?

To be pedantic, 7 signficant bits, yes, but I doubt you’d find many people who’d bother to bring up that distinction. (People old enough to recall when that the 8th bit was used as a check bit for old serial communications, maybe? Or people who were also familar with machines that had 9-bit “bytes” [and 18 and 36 bit architectures :grinning:])

ShortAnsiStrings does what is sought

Maybe off topic but this ( Why are there so many(106) Hungarian collations? ) shows me that problem with sorting is a little complicated than I thought.

I was thinking about O(N) transformation before sort and after sort to get proper sorting in locale alphabet. (or transformation in constructor and conversion to String)

But simple naive solution could not help for example if we want case insensitive sorting
 (it is problem with ShortAnsiStrinsg too)

I hope that (because Julia is so flexible :heart:) there is possibility to create nice Short8bitStrings package supporting collate. (but not sure there is need for it in this moment)

Except that this isn’t just ANSI compatible strings - CP1250 is not ANSI compatible (Microsoft added their own set of characters in the area that is control characters for all of the ANSI/ISO-8859 8-bit character sets).

There are so many ins and outs with correctly dealing with sorting for presentation purposes
(sorting simply for storage in a B-tree or other such internal operations is much, much easier - sorting in a Unicode codepoint compatible ordering is generally fine, even sorting by the UTF-16 code units (which doesn’t match the Unicode codepoints for codepoints > 0xffff works fine for that).

I really don’t think that’s where it should be handled, quite honestly. Truly handling collation involves locale specific tables, doing decomposition/composition, multiple pass operations, and more. That doesn’t fit well with a type with a maximum number of bytes. That is the sort of optimization that really should be hidden in the implementation details of an overall string package (like “Strs.jl” :wink: )

1 Like

Do you still plan to do thorough benchmark comparison with standard library?

Of course - you can download and run them yourself right now if you want.

bah 
 too literal :slight_smile:

I think he encodes 8bit quantities even when they self-identify as 7bit quantities. and if I were using it, I may want to slip in some top-half of the old-timey codepage. ShortByteStrings

2 Likes