Unicode 15.0 (beta) and sorting/collation

A.
Unicode 15 is in beta, just as Julia 1.8-rc1 is catching up to 14.0 (few, if any, programming languages already support 14.0.0, Python 3.11.0b3 does, but unclear if backported to current Python). Julia could be first for to support 14 or 15.

The comment period ends July 12, 2022.

What I see so far, Julia seems to be in good shape, I think the security issue is known and fixed, unless that one isn’t the one I had in mind:

  • There are a series of updates to UAX #31, Unicode Identifier and Pattern Syntax and to UTS #39, Unicode Security Mechanisms, to clarify issues regarding identifiers in programming languages, particularly in bidirectional contexts, as well as the use of ZWJ and ZWNJ in identifiers. A coordinated example was also added to UAX #9, Unicode Bidirectional Algorithm, to illustrate […]
  • In UAX #38, Unicode Han Database (Unihan) […] UAX #38 also has updated regex values for numerous Unihan properties.
  • In UAX #45

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 15.0.0 repertoire for UCA 15.0. For the most part, the additions for new scripts and other characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

  • 20 new emoji characters have been added. However, […] If your implementation supports emoji, be sure to carefully review UTS #51, Unicode Emoji (PRI #454).
  • WARNING: There is a change to the end of one existing CJK unified ideograph range in Unicode 15.0.0.

@Lilith I intend to make a string type, fastest so far with or without sorting, that supports my native Icelandic and even English, and many but not all European languages better. ASCIIbetical or codepoint order isn’t good, even as default for English, rather AaBcCc etc. or for Icelandic AaÁáBc etc. that would work for English too, since a superset alphabet, and shouldn’t trouble English speakers unless they insist on e.g. Áá after the English alphabet… or want want e.g. AÁaá?

I do not indent to support the full:

Unicode Collation Algorithm

does anyone know of such a library (in ICU or elsewhere better), that could be wrapped for Julia (already done?).

At least there are no additions to it, but many surprises already there (every time I read the document).I scanned the draft document, then saw at the bottom confirmed (only): "Minor textual edits and updates for Version 15.0 references.

https://www.unicode.org/reports/tr10/tr10-46.html

3.8.1 Backward Accents
In some French dictionary ordering traditions, accents are sorted from the back of the string to the front of the string.

Simple Method

The specification of the Unicode Collation Algorithm requires that Hangul syllables be decomposed. However,

A deterministic comparison is different than either a stable sort or a deterministic sort; it is a property of a comparison function, not a sort algorithm. […]
A deterministic comparison is generally not good practice.

B.
In related news Python 3.10.5 is out with:

The PEP was scheduled for 3.11, I suppose they changed early or backported? Does anyone know?

@cjdoris Is any of above or below relevant to PythonCall.jl?:

Python 3.11 is in beta (3.11.0b3):

Python 3.11 is up to 10-60% faster than Python 3.10. On average, we measured a 1.25x speedup on the standard benchmark suite. See Faster CPython for details.

Would this already be backported (I don’t see it, there would be to look?):

https://bugs.python.org/issue45190

C.
Not much new in 15.0, but the ship has long sailed on fixed-length Unicode in practice (despite e.g. UTF-32 or UCS-2).

[I really don’t know or want to support emojis for sorting, I guess though simply done by codepoint-order.]

https://www.unicode.org/reports/tr51/tr51-22.html#multiperson_skintones

RGI ZWJ sequences were updated to add 25 skin tone combinations for woman and man holding hands, and 15 combinations each for women holding hands, men holding hands, and people holding hands. These sequences appear as 70 different images.
[…]
Emoji 12.1 addition: 1F468 1F3FB 200D 1F91D 200D 1F468 1F3FD ; men holding hands: light skin tone, medium skin tone

The only difference between the above sequences is that the inferred positions of the medium-skin-tone man and the light-skin-tone man are swapped, left and right.

Implementations can use the same image for both sequences.
[…]
However, in Emoji 15.0, such emoji modifier sequences only have RGI status for six of the nine characters: kiss, couple with heart, woman and man holding hands, men holding hands, women holding hands, and handshake.

1 Like

FYI: I just found there’s a defined emoji ordering (and it’s not, simply, code-point order):

https://unicode.org/emoji/charts-15.0/emoji-ordering.html

That’s one more nail in the coffin of code-point order useful. I don’t plan to support emoji ordering (let alone with modifiers), keeping those, and alphabets (non-Latin ones) I don’t care about, in code point order, in my localized sorting code (for some European languages), in my new string type.