Within a week …is Christmas. Julia needs to release 1.0; I cannot believe that now under this pressure it is possible/advisable to convince people that the current string architecture must be changed (again). Isn’t there a compromise possible where there is a unified UTF-8 String for mere mortals where Unicode must be imported but you still can do your fast string / char magic? And later in 1.x possible small adjustments can be discussed in a more relaxed manner?
[It hurts a bit to read this thread. With all due respect (you’re a 100x better programmer than me), shouldn’t one stop at some time with arguing? There must be dozens of difficult issues which not all can be solved nor discussed by a rather small core developer group]
I think this is an important point which does not (yet?) get the emphasis it deserves in the manual, which starts with a lot of examples of codepoint conversions, and does not mention loops at all. Is that going to come later in a separate PR?
Yes, absolutely. I’m going to do a big update to the string section after feature freeze. Unfortunately, I haven’t had the time to do that yet, but it will definitely make this point and explain the (clarified) abstract string API.
I’m afraid that recommendations like that in the manual will definitely prove to people knowledgeable about text processing that Julia is not going to be a good language for them.
It’s perfectly possible to implement Unicode character classification based directly on the UTF-8 character encoding, without computing the codepoint as an integer. (At the moment, character classification in Julia does involve converting to an integer codepoint value, but this is an implementation detail that could be changed.)
The ISO-C character-classification functions are not particularly useful for this task: even supposedly Unicode-aware functions like iswdigit are useless because (a) they are BMP-only on Windows and (b) their return value is not portable, depending on operating system Unicode tables which are usually out of date compared to the tables in utf8proc that are used by Julia. That’s why we ended up implementing our own isdigit etcetera in Julia.
Note also that there are plenty of circumstances in which direct character comparisons are appropriate, e.g. when parsing numeric strings into numbers you commonly only want to accept ASCII digits, i.e. you want '0' ≤ c ≤ '9'.
It is frustrating, Scott, that you so commonly equate “not implementing things the way I prefer” with “not taking string-processing seriously”.
You seem to think that those tables are something fixed.
First off, they are locale specific, and cannot necessarily even be handled by a one-to-one mapping, even in Unicode (the German ß character is a good example), and the Turkic languages i & dotted uppercase İ, and dotless lowercase ı and I.
You should probably read the following to get a better idea: Case folding - Internationalization
It is very frustrating, when members of the Julia core team don’t have the requisite knowledge in this domain, and instead of investigating the issues in the domain, prefer to believe that this is something just about what I prefer, and making ad hominem comments.
I had to deal with all of these issues decades ago because paying customers demanded them, for critical applications, where people’s lives depend on things like not having the doctor’s notes unreadable garbage,
because some application developer didn’t understand the issues of different character sets, and chose an approach like Stefan’s of just assuming what the character set was [even though they had all the tools to handle the conversions correctly]).
Stefan keeps bringing up how much of the Internet now uses UTF-8 (but with some probably incorrect numbers, because the 90% figure didn’t come from Google, but rather another company that looked at websites, not webpages, and just of what they considered to be the “top” websites, based on a ranking by Amazon Alexa, and they would count a site as being “UTF-8” if one page on the site was UTF-8, even if all others where ASCII (as Google stated in their study, many pages are marked as UTF-8 that really only contain ASCII). The real answer, using the criteria Google used, of counting webpages, and checking for ASCII only pages, is probably still a good bit less than 90% (I’d bet somewhere around 75-80%).
However, that is only a statistic of what is seen when looking at web pages, whose content is usually generated dynamically based on data from a database. That’s where you really need to look to see which character sets are more important to handle when doing text processing (and there, you’ll find a lot of ASCII, ANSI Latin 1, MS CP 1252, GB 2312, GB 18030, BigFive, UTF16, as well as UTF8 [in the places where it doesn’t blow up the size of the data].
UTF-8 is great for web pages, however it’s very bad for text processing, particularly outside of Western Europe. There are good reasons why languages such as Java, Objective-C, Swift (and many libraries such as ICU for C, C++, Go, Rust) etc. use UTF-16 for that.
When I added Unicode (1.0) support to InterSystem’s language/database (now known as Caché ObjectScript, based on ANSI Standard M/Mumps), back in the mid-90’s, even UTF-16 was not acceptable for the Asian market, because data that was encoded in their national MBCS such as S-JIS in Japan, generally took less than 1.5 bytes per character for the types of data stored in their databases, i.e. records with mixed numeric and text fields (and no more than 2 bytes, for things like text documents). I ended up inventing a scheme for packing Unicode strings such that pure ASCII & Latin1 strings ended up being 1 byte per character, sequences from other alphabetic languages also ended up being 1 byte per character (or less!), and Japanese for example pretty much always shorter than S-JIS (due to encoding sequences in particular 128-byte Unicode pages, sequences of digits, and repeated characters efficiently). (That was not used for processing, just for storage in the database).
People who didn’t want to use even UTF-16 for their data, because it frequently caused a 33% size penalty over S-JIS, would never want to use UTF-8, which gives a 100% penalty.
Many customers also extensively use different locales, or even custom locales, with mapping for IO translations, upper/lower/titlecase etc. mappings, categorization (upper/lower/title/alphabetic/numeric/identifier/identifierstart/print/graph/punct/etc). In addition, to seriously support text processing, you need to be able to handle defining those tables even for custom character sets (there are many somewhat standardized ones, using the Private Use Areas, often for things that are trying to get the characters added to the Unicode standard).
Should Julia ignore the official character set of the world’s most populous country, China, i.e. GB18030 (GB 18030 - Wikipedia)?
If you read that, you can see how, in their official character set, there are still a number of characters which are currently mapped as PUA codepoints?
Maybe now you can understand why the idea of having separate namespaces for generic functions such as isalpha/uppercase/lowercase makes no sense at all?
I agree, until you do uppercase potentially on µ. Would it matter to you if you could get fast direct indexing on UTF-8 strings? I had an idea similar to Python’s where you store a count of ASCII (to be UTF-8 compatible, or potentially CP-1252 prefix count) as a prefix of your string. [And then maybe further count of UTF-16.]
I forgot to mention a few other twists (had to help wife with misbehaving iPhone, had to drive 45 minutes to get a Genius bar appointment on the same day!).
Besides having to deal with one-to-one, many-to-many, one-to-many, and many-to-one transformation tables
(for example, German locale uppercase("ß") should return "SS", and lowercase("SS") return "ß", correctly handling collation is a lot more complicated, you need to be able to handle different types of sorting, even in a single locale - for example, things like in French, with the following four words:
cote (rating)
coté (highly regarded)
côte (coast)
côté (side)
Many times, you want to have a case-insensitive sort, as well as a case-sensitive one.
There are just so many issues, it really does take a long time to gain experience with all the different ones.
If there’s an issue about linear algebra, sure, I’d ask somebody like Steven or Stefan, or my professor for 18.06, Gilbert Strang, sure, and trust in their areas of expertise.
I wish that people in the core group would at least consider the possibility that this is an area where I do have a significant amount of rather relevant expertise.
If you ask for a uppercase operation within the character set, then if you have a string using the CP-1252 character set (don’t confuse with encoding!) (say of type Str(enc"CP1252")) then the result will be that ‘µ’ will not be changed, the same as an uppercase operation on Latin1ÿ on a Str(enc"Latin1").
Internally, with my new string package, I actually have 2 Latin1 types, that differ only in the way uppercase is handled. One is for actually processing ISO 8859-1 strings as such ‘LatinStr’, and the other ‘LatinUStr’, is for the internal optimization of having Unicode strings consisting only of characters <= 0xff to be stored as bytes, so saving significant space compared to UTF-8 encoding for a number of Eastern European languages)
With the boom of deep natural language processing and the larges corpus of text that the machine learning community is handling, I don’t think this is a good way forward. Every performance gain matters at the end.
BTW, I never said that those functions should be used. I referenced that solely for the discussion of the history, and why people moved away from hard coded character comparisions (which is a bad idea), to locale based tables.
I architected and implemented support for handling all of that, including ways for customers to easily set up their own io translation, lower/upper/identifier/etc. tables, upper/lower/title conversion tables, X/Y tables (i.e. character widths and whether a character moved the cursor (like \b \t, \r, or \n)), and collation tables, for single, multibyte, wide character sets, and later for Unicode.
Here is a link to the documentation for the customer-facing utilities for handling those tables: Customizing the Caché System | Caché Specialized System Tools and Utilities | Caché & Ensemble 2018.1.4 – 2018.1.8
But done is also better than perfect. Julia can be a stable language with a growing community or a neverending work-in-progress that never gets traction
OTOH, having a good design at the start can mean the difference between something that can grow in the future, or something that is forever limited by the bad decisions of the past.
At least currently, String has some special advantages, because it is implemented in the system.
Also, why have something in base that is less performant, takes 50% more space per character for the languages of most of the world, is harder to use, and leads to problems with “Garbage In, Garbage Out”?
I believe Julia deserves to have best in class string handling, don’t you?
Can you be more specific? My impression is that the whole of string handling is written in Julia, and uses facilities of the language that are also available to libraries.
I think that they key reason for the various disagreements you brought up in this topic is that your ideas on what is “best” differ from other people’s.
But this is OK. I think that best way to argue that your approach is “better” is not to disparage the implementation in Base, but to implement a library and wait for people to flock to it. If that happens, you would be in a strong position to argue that your way is better.
Unfortunately, that’s not true. String is special, strings are created with the following function (and a few other similar ones for optimization purposes):
function String(v::Array{UInt8,1})
ccall(:jl_array_to_string, Ref{String}, (Any,), v)
end
that uses so that referencing the string type does not have an extra indirection, unlike what you can do in Julia:
struct Str{T} <: AbstractString ; data::Vector{UInt8} ; end
If you know of a way in Julia (best if supported) of getting my Str{T} type instead of jl_string_type,
which is hard-coded into the C code that String uses, then maybe I could eliminate that very serious impact on performance, both in speed and memory (because it has to allocate 16 bytes just to hold the Str{T} type and the pointer to the Vector{UInt8}, and has to follow the pointer on accesses…