Problems with deprecations of islower, lowercase, isupper, uppercase

ScottPJones · December 22, 2017, 1:55am

The problem is thinking that if you have a hammer, every problem in the world is a nail

UTF-8 is exactly the right solution for things like Web pages, data transfer, and text in parts of the world with languages where characters outside the ASCII range are the exception rather than the rule.
However, it is not good at all when you are doing a lot of text processing, or for storage of text written in languages used by probably about 3 quarters of the world’s population. Doubling the amount of storage required, moving from something like SJIS or GB to UTF-8, is simply not acceptable to many customers, rightly so! (That was a major competitive advantage that we (at InterSystems) had, compared to other database vendors, who tried to push UTF-16 or UTF-8 on their Asian customers, because I had a compaction scheme for Unicode that stored the text in even less space than SJIS, instead of incurring around a 33% penalty (1.5 bytes per character average to 2 bytes) for using UTF-16, or generally 100% penalty for UTF-8 (1.5 → 3 bytes).

I don’t believe that I’m distorting it. People who deal with this sort of stuff already have these tables, based on the Unicode codepoints. Do you really think that to claw back the performance lost, they’d have to redo all their tables, using Stefan’s rather complicated encoding, which would then only be useful for Julia? Using standard Unicode codepoints, those tables can simple be stored in a compiled library, shared and used by all programs on a system.

The reason I brought that up, and the recommendations of the W3C, is that the default upper/lower/titlecase mappings for Unicode are not something fixed and cast in stone, and they are not even recommended for user applications, which really should be locale specific, which means you need to deal with locale specific tables,
constantly going back and forth between Char and UInt32, which, until #24999, was a no-op.

Because of the principal of “Garbage In, Garbage Out”, and Stefan’s philosophy about handling text data (which might have been OK at someplace like a web retailer such as Etsy, but doesn’t fly at all when people’s lives or their livelihoods are at stake, such as with the sorts of medical and financial applications that were written in the language I was responsible for).
The problem isn’t UTF-8 at all, it is allowing misidentified data to be input, and not detected at the first point it is encountered. Many times, unless you have available the entire file read in, you may not have enough information to try to auto-detect an encoding, or if you store it away, possibly changing sequences that it thought was “invalid UTF-8”, you might not realize until you can no longer get the original data, that it really was something like CP-1252. Then when that data, which might be critical, is needed, you found out it’s unusable garbage,
because you have no way of telling what character set really was.

Stefan wants to avoid errors, from what he thinks of as “invalid data”, but, in my experience (and as he should remember from a number of incidents over the last couple of years, with people having problems with Julia), where the Julia code simply assumed that everything was UTF-8, and didn’t have good tools for developers, to deal with the issues, such as:

make sure that all immutable strings are always valid (which also can give major performance advantages
when processing)
have a reasonable way of dealing with inputting variations of encodings (such as the variants of UTF-8, caused by issues of long encodings, like Java does with \0, or where two UTF-16 surrogate characters are encoded as 2 3-byte UTF-8 sequences, instead of the correct 4-byte sequence, which will mean that the UTF-8 string will not sort or hash correctly, if you don’t produce valid UTF-8)
Allow for as much as possible, auto-detecting the most common character sets, and make it easy to call external tools that can do an even better job of character set detection, and detect things like UTF8BOM, 16-bit and 32-bit BOM and byteswapped BOMs, as well as UTF-16 simply widened to 32-bits (this does happen!).
having both “safe” conversions, that give an error, or “unsafe” conversions, that may return a raw vector of bytes, 16-bit words, or 32-bit words, if the input data is not valid for whatever character set/encoding you had identified it as, as well as giving the option replacing invalid sequences with zero or more characters, or even passing a function to be called for each detected sequence of invalid code units, that can either give an error or return zero or more code units (or possibly code points) to be inserted in the output instead of the invalid sequence.

I was surprised that people decided on such a non-generic (to me, non Julian) way here.
If you have a numeric type, and a transformation such as ‘+’, do you have to call ‘UInt8_add’, ‘Dec128_add’, or ‘BigInt_add’ to perform the operation? Does it matter if the result of a + b is not the same if a and b are UInt8 or Dec128 or BigInt? No, not at all! 0x80 + 0x80 = 0x00, but 128 + 128 = 256.
So, why should somebody have to load a specific “Unicode” package, and then call functions specific to that package, just to do a generic operation such as ‘uppercase’?
So, if I have a Str{:SJIS} type string, and I perform some operation such as uppercase on it, you should be able to do that, without worrying about Unicode at all. Other standards, such as GB 13080, have their own ideas of default mapping tables, which should be respected.
Base.Char definitely should respect Unicode (and with #24999, it really doesn’t anymore, because you could have any sort of invalid character). A Chr{:SJIS} though, to represent SJIS standard codepoints, should respect that standard, the same as a codepoint type for GB 18030 strings.

These deprecations of islower, lowercase, etc. are also causing a lot of churn, in code such as DecFP.jl that simply was trying to deal with ASCII characters.

It also doesn’t make sense to me, that if Julia itself is using Unicode tables for identifiers and to tell if something is upper, lower, alphabetic, numeric, etc., and Char is defined as representing a Unicode code point, and String as logically a collection of Char, why that isn’t simply part of the Base language?
I realize that people are concerned about tying the language to a particular version of the Unicode standard, but this trying to make people have to do using Unicode, for things that are already present (but now hidden) in Julia, I don’t believe is the best way of trying to handle the issue.
Instead, I believe that the tables could be autogenerated, based on both the Unicode data and Julia’s exceptions (for example, for the identifiers), and compiled into a shared library, instead of having so much, such as what things are valid identifiers, hard coded into places in utf8proc and the FemtoLisp parser, and have other behavior (LaTeX and Emoji tab expansions) hard coded into the REPL.
That is the simple technique which I used for the Unicode, HTML, Emoji, and LaTeX entity tables to provide StringLiterals with the ability to do things like “<dagger>” or “” as part of my extended escape sequences.

If you want to discuss this further, I’d be more than happy to - I know you are supposed to be on vacation now, so maybe afterwards (by then, I should have my Strs.jl package well tested with lots of benchmark data, pushed to GitHub).

Topic		Replies	Views
Changes to the representation of Char Internals & Design	14	2989	December 12, 2017
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1601	December 5, 2023
Is Julia well-suited for string manipulation? General Usage strings	24	3252	March 24, 2023
Performance of length(::String) Performance	24	4095	July 28, 2018
Problem processing non utf8 string New to Julia	17	2311	June 1, 2018

Problems with deprecations of islower, lowercase, isupper, uppercase

Related topics