In my experience with supporting applications that were heavily into string processing for many years, and dealing with all sorts of pre-Unicode single and multibyte representations, as well as Unicode 1.0 UCS-2, and all sorts of Unicode encoding variants over the years, I haven’t seen cases where people really wanted to deal with invalid data in their applications, they wanted: 1) use of a replacement character (possibly of their choice), 2) exceptions with enough information returned to figure out how to proceed if desired, 3) user handling of invalid sequences (to allow more complex behavior than a simple replacement character or exception) as well as being able to optionally allow on input things like: overlong UTF-8 sequences (frequently used for representing a nul byte as \0xc0\0x80, used by Java), or characters in the 0x10000-0x1fffff range represented as the two surrogate characters encoded as 3 bytes each, instead of a single 4-byte sequence.
That’s exactly the kind of thing that allowing for invalid Unicode in strings will make possible.
I don’t see why we wouldn’t be able to use AbstractChar
if it turns out that the standard Char
has a noticeable performance impact for some particular use cases.