Changes to the representation of Char

nalimilan · November 23, 2017, 3:34pm

Unicode related error when reading a .csv

In my experience with supporting applications that were heavily into string processing for many years, and dealing with all sorts of pre-Unicode single and multibyte representations, as well as Unicode 1.0 UCS-2, and all sorts of Unicode encoding variants over the years, I haven’t seen cases where people really wanted to deal with invalid data in their applications, they wanted: 1) use of a replacement character (possibly of their choice), 2) exceptions with enough information returned to figure out how to proceed if desired, 3) user handling of invalid sequences (to allow more complex behavior than a simple replacement character or exception) as well as being able to optionally allow on input things like: overlong UTF-8 sequences (frequently used for representing a nul byte as \0xc0\0x80, used by Java), or characters in the 0x10000-0x1fffff range represented as the two surrogate characters encoded as 3 bytes each, instead of a single 4-byte sequence.

That’s exactly the kind of thing that allowing for invalid Unicode in strings will make possible.

I don’t see why we wouldn’t be able to use AbstractChar if it turns out that the standard Char has a noticeable performance impact for some particular use cases.

Topic		Replies	Views
Valid chars Offtopic question , strings	6	616	March 5, 2019
Problem processing non utf8 string New to Julia	17	2162	June 1, 2018
Bug in isvalid with an overlong UTF-8 encoded vector or string Internals & Design	3	583	September 21, 2018
AbstractChar (and #26286) Internals & Design	2	606	March 3, 2018
Strftime & strptime bug #27239 is present on all platforms, not just Windows Internals & Design bug	13	2765	June 3, 2018

Changes to the representation of Char

Related topics