Problems with deprecations of islower, lowercase, isupper, uppercase

ScottPJones · December 20, 2017, 5:58pm

You seem to think that those tables are something fixed.
First off, they are locale specific, and cannot necessarily even be handled by a one-to-one mapping, even in Unicode (the German ß character is a good example), and the Turkic languages i & dotted uppercase İ, and dotless lowercase ı and I.
You should probably read the following to get a better idea: Case folding - Internationalization

It is very frustrating, when members of the Julia core team don’t have the requisite knowledge in this domain, and instead of investigating the issues in the domain, prefer to believe that this is something just about what I prefer, and making ad hominem comments.

I had to deal with all of these issues decades ago because paying customers demanded them, for critical applications, where people’s lives depend on things like not having the doctor’s notes unreadable garbage,
because some application developer didn’t understand the issues of different character sets, and chose an approach like Stefan’s of just assuming what the character set was [even though they had all the tools to handle the conversions correctly]).

Stefan keeps bringing up how much of the Internet now uses UTF-8 (but with some probably incorrect numbers, because the 90% figure didn’t come from Google, but rather another company that looked at websites, not webpages, and just of what they considered to be the “top” websites, based on a ranking by Amazon Alexa, and they would count a site as being “UTF-8” if one page on the site was UTF-8, even if all others where ASCII (as Google stated in their study, many pages are marked as UTF-8 that really only contain ASCII). The real answer, using the criteria Google used, of counting webpages, and checking for ASCII only pages, is probably still a good bit less than 90% (I’d bet somewhere around 75-80%).
However, that is only a statistic of what is seen when looking at web pages, whose content is usually generated dynamically based on data from a database. That’s where you really need to look to see which character sets are more important to handle when doing text processing (and there, you’ll find a lot of ASCII, ANSI Latin 1, MS CP 1252, GB 2312, GB 18030, BigFive, UTF16, as well as UTF8 [in the places where it doesn’t blow up the size of the data].
UTF-8 is great for web pages, however it’s very bad for text processing, particularly outside of Western Europe. There are good reasons why languages such as Java, Objective-C, Swift (and many libraries such as ICU for C, C++, Go, Rust) etc. use UTF-16 for that.

When I added Unicode (1.0) support to InterSystem’s language/database (now known as Caché ObjectScript, based on ANSI Standard M/Mumps), back in the mid-90’s, even UTF-16 was not acceptable for the Asian market, because data that was encoded in their national MBCS such as S-JIS in Japan, generally took less than 1.5 bytes per character for the types of data stored in their databases, i.e. records with mixed numeric and text fields (and no more than 2 bytes, for things like text documents). I ended up inventing a scheme for packing Unicode strings such that pure ASCII & Latin1 strings ended up being 1 byte per character, sequences from other alphabetic languages also ended up being 1 byte per character (or less!), and Japanese for example pretty much always shorter than S-JIS (due to encoding sequences in particular 128-byte Unicode pages, sequences of digits, and repeated characters efficiently). (That was not used for processing, just for storage in the database).
People who didn’t want to use even UTF-16 for their data, because it frequently caused a 33% size penalty over S-JIS, would never want to use UTF-8, which gives a 100% penalty.

Many customers also extensively use different locales, or even custom locales, with mapping for IO translations, upper/lower/titlecase etc. mappings, categorization (upper/lower/title/alphabetic/numeric/identifier/identifierstart/print/graph/punct/etc). In addition, to seriously support text processing, you need to be able to handle defining those tables even for custom character sets (there are many somewhat standardized ones, using the Private Use Areas, often for things that are trying to get the characters added to the Unicode standard).
Should Julia ignore the official character set of the world’s most populous country, China, i.e. GB18030 (GB 18030 - Wikipedia)?
If you read that, you can see how, in their official character set, there are still a number of characters which are currently mapped as PUA codepoints?
Maybe now you can understand why the idea of having separate namespaces for generic functions such as isalpha/uppercase/lowercase makes no sense at all?

Topic		Replies	Views
Isupper is not working New to Julia question , package	3	174	May 9, 2024
Deprecation => error? Internals & Design	10	1664	December 30, 2017
Function naming (clash, convention) New to Julia	3	474	January 21, 2017
String Handling Functions Performance functions	7	220	February 23, 2023
Base conversion (hex, bin, octal...) General Usage	15	8570	October 11, 2018

Problems with deprecations of islower, lowercase, isupper, uppercase

Related topics