WIP: faster string sort

JeffreySarnoff · February 7, 2018, 11:38pm

off-the-cuff interconversion


chars(str::String) = map(x->Char(x[1]), split(str, ""))
ascii(chars::Vector{Char}) = map(Int8, chars)
ascii(str::String) = ascii(chars(str))

chars(ascii::Vector{Int8}) = map(Char, ascii)
string(chars::Vector{Char}) = join(chars, "")
string(ascii::Vector{Int8}) = string(chars(ascii))

julia> str = "string"
julia> ascii_codes = ascii(str)
6-element Array{Int8,1}:
 115
 116
 114
 105
 110
 103

julia> string(ascii_codes)
"string"

Liso · February 8, 2018, 6:42am

8bits is not sufficient condition. For example: CP1250 (central and eastern european encoding) is locale dependent and in hungarian alphabet ‘é’ (0xE9) is less than ‘z’ (0x7A)

@xiaodai maybe rename package/module to ShortAsciiStrings (or types to ShortAsciiString…) would be appropriate?

xiaodai · February 8, 2018, 6:42am

Fair enough

ScottPJones · February 8, 2018, 6:53am

But those aren’t ASCII strings at all, as ASCII is purely 7-bit.

Short8BitStrings might be more appropriate, if you intend for it to continue to be limited to only 8-bit code units.

Liso · February 8, 2018, 8:04am

7 bit or 7 significant bit?

In other words - how do you want to encode 7bit chars in 64bit architecture?

ScottPJones · February 8, 2018, 8:08am

To be pedantic, 7 signficant bits, yes, but I doubt you’d find many people who’d bother to bring up that distinction. (People old enough to recall when that the 8th bit was used as a check bit for old serial communications, maybe? Or people who were also familar with machines that had 9-bit “bytes” [and 18 and 36 bit architectures ])

JeffreySarnoff · February 8, 2018, 7:25pm

ShortAnsiStrings does what is sought

Liso · February 8, 2018, 7:49pm

Maybe off topic but this ( Why are there so many(106) Hungarian collations? ) shows me that problem with sorting is a little complicated than I thought.

I was thinking about O(N) transformation before sort and after sort to get proper sorting in locale alphabet. (or transformation in constructor and conversion to String)

But simple naive solution could not help for example if we want case insensitive sorting… (it is problem with ShortAnsiStrinsg too)

I hope that (because Julia is so flexible ) there is possibility to create nice Short8bitStrings package supporting collate. (but not sure there is need for it in this moment)

ScottPJones · February 8, 2018, 7:58pm

Except that this isn’t just ANSI compatible strings - CP1250 is not ANSI compatible (Microsoft added their own set of characters in the area that is control characters for all of the ANSI/ISO-8859 8-bit character sets).

ScottPJones · February 8, 2018, 8:09pm

There are so many ins and outs with correctly dealing with sorting for presentation purposes
(sorting simply for storage in a B-tree or other such internal operations is much, much easier - sorting in a Unicode codepoint compatible ordering is generally fine, even sorting by the UTF-16 code units (which doesn’t match the Unicode codepoints for codepoints > 0xffff works fine for that).

I really don’t think that’s where it should be handled, quite honestly. Truly handling collation involves locale specific tables, doing decomposition/composition, multiple pass operations, and more. That doesn’t fit well with a type with a maximum number of bytes. That is the sort of optimization that really should be hidden in the implementation details of an overall string package (like “Strs.jl” )

Liso · February 8, 2018, 8:14pm

Do you still plan to do thorough benchmark comparison with standard library?

ScottPJones · February 8, 2018, 8:25pm

Of course - you can download and run them yourself right now if you want.

JeffreySarnoff · February 8, 2018, 8:55pm

bah … too literal

I think he encodes 8bit quantities even when they self-identify as 7bit quantities. and if I were using it, I may want to slip in some top-half of the old-timey codepage. ShortByteStrings

Topic		Replies	Views
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4643	January 29, 2018
[ANN] SortingLab.jl - fast sorting algorithms for strings and CategoricalArrays Package Announcements sort , sortperm	2	1111	January 15, 2019
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4651	February 4, 2018
10x faster sortperm() Performance sortperm	10	2691	October 29, 2024
[ANN]: WIP Strs.jl package ready for alpha review and testing Community	15	3259	April 3, 2018

WIP: faster string sort

Related topics