Problems with deprecations of islower, lowercase, isupper, uppercase

Yes, that would be a breaking change. But we already tried this for two years and it had really poor usability. The trouble is that you want to read data in and then be able to check if it’s valid or not. If String isn’t allowed to hold invalid data, then you have to do everything with byte vectors first and only deal with strings once you’re sure you’ve got clean data. This leads to a perverse situation where packages that do heavy string processing like CSV and TextParse have to avoid using the String type in order to be robust.

In the new design, you can work with Strings that wrap invalid data no problem. If you want to know if a string or character is valid, just call Unicode.isvalid on it. If you want to replace invalid characters, just do this:

Unicode.isvalid(c) || (c = '\ufffd')

If you want to ignore invalid characters, just do

Unicode.isvalid(c) || continue

If you want to raise an error… you get the idea. It’s simple, easy to understand, and doesn’t impose a validity or transcoding tax on you unless you want it. There are some operations that do throw errors for invalid data. If you try to convert an invalid character to a code point value, for example, that’s an error – since there’s no well-defined answer.

I’m conflicted here, would we want garbage in, and “same content” (or exactly same, say when implementing cat) garbage out, instead of the exception?

Like I said, we tried it @ScottPJones’s way for two years and it has truly awful ergonomics for real-world data processing – this is not hypothetical or just my opinion, the data ecosystem has been struggling with it badly. Fundamentally, you need a string type that lets you choose whether to throw an error or not. You can wrap that in a stricter string type that always validates, but you can’t implement the relaxed version in terms of the strict one. So Base Julia gives you the more general, non-strict version and leaves strict Unicode enforcement to packages.

Seems bad, but please don’t add an (explicit, not subtype) ASCII type, or any legacy-1-byte encoding to Base. It might be too tempting for people to use, locking your code out of Unicode. Rather have UTF-8, with ASCII subset, included as the go to default.

Don’t worry, we’ve been there and it was bad. We’re not going to do that again. We might have an ASCII module with ASCII-only versions of things like case transformation. If you only want to do a simple ASCII uppercase transform, you could then call ASCII.uppercase(s) and it would uppercase ASCII characters and leave others alone, which is a very fast, simple operation in many encodings.

Paying the penalty once, instead of every time a character is accessed, usually far outweighs the cost of doing the validation.

That seems likely, but I can think of cases when at least some strings read in, are never used again…

On the contrary, while this premise and point of view makes lots of sense in @ScottPJones’s line of work – building textual databases, where data is loaded once and queried many times – this is actually quite atypical. Most of the time, a program scans through a text file, computes derived values and then throws the text away. In such use cases, being forced to do lots of work up front, only to never look at those strings again, is a total waste of time and effort.

As a general rule of thumb for high-performance computing, you don’t want to do work speculatively, you want to be as lazy as possible. @ScottPJones is trying to impose a very specific world view on the entire Julia language and ecosystem: he happens to work in an area where it does make sense to do more work pre-processing your text data so that accessing it later is faster. That’s fine and it can definitely be supported with packages, but it is not the norm. There is also an asymmetry here: if the default behavior is not to do unnecessary work, you can always opt into doing more work. But if you’ve already done the work because that’s the fundamental built-in behavior, you’re out of luck – you can’t undo work that’s already been done.

18 Likes