To be pedantic, 7 signficant bits, yes, but I doubt youâd find many people whoâd bother to bring up that distinction. (People old enough to recall when that the 8th bit was used as a check bit for old serial communications, maybe? Or people who were also familar with machines that had 9-bit âbytesâ [and 18 and 36 bit architectures ])
I was thinking about O(N) transformation before sort and after sort to get proper sorting in locale alphabet. (or transformation in constructor and conversion to String)
But simple naive solution could not help for example if we want case insensitive sorting⊠(it is problem with ShortAnsiStrinsg too)
I hope that (because Julia is so flexible ) there is possibility to create nice Short8bitStrings package supporting collate. (but not sure there is need for it in this moment)
Except that this isnât just ANSI compatible strings - CP1250 is not ANSI compatible (Microsoft added their own set of characters in the area that is control characters for all of the ANSI/ISO-8859 8-bit character sets).
There are so many ins and outs with correctly dealing with sorting for presentation purposes
(sorting simply for storage in a B-tree or other such internal operations is much, much easier - sorting in a Unicode codepoint compatible ordering is generally fine, even sorting by the UTF-16 code units (which doesnât match the Unicode codepoints for codepoints > 0xffff works fine for that).
I really donât think thatâs where it should be handled, quite honestly. Truly handling collation involves locale specific tables, doing decomposition/composition, multiple pass operations, and more. That doesnât fit well with a type with a maximum number of bytes. That is the sort of optimization that really should be hidden in the implementation details of an overall string package (like âStrs.jlâ )
I think he encodes 8bit quantities even when they self-identify as 7bit quantities. and if I were using it, I may want to slip in some top-half of the old-timey codepage. ShortByteStrings