Problems with deprecations of islower, lowercase, isupper, uppercase

I’ve been getting failures loading quite a few packages because of the move of these (at least the exports) to Unicode.

For a number of these incredibly frequently used names, wouldn’t it be better to have the name exported from Base, so that instead of a deprecation warning, you’d get a method error (unless you have loaded a package that extends those functions, or done using Unicode)?

I’m not saying that they should be defined in base, just have something like:

export uppercase
"""Uppercase a string"""
function uppercase end

# or even:
function uppercase(str::AbstractString)
    ... deprecation warning ...
end

That would make it a lot easier for people writing string handling packages (like for encoded strings, mutable strings, or faster Unicode strings :slight_smile: ) could simply extend the function in base, without having to add the Unicode package.

The Unicode package is in the standard library, so they don’t need to add it — it will always be installed.

Why all the deprecation errors then (it’s worse than just a warning)?

julia> lowercase("ABCDEF")
ERROR: Base.lowercase has been moved to the standard library package Unicode.
Restart Julia and then run `using Unicode` to load it.
Stacktrace:
 [1] error(::Function, ::String, ::String, ::String, ::String, ::String, ::String) at ./error.jl:42
 [2] #lowercase#953(::NamedTuple{(),Tuple{}}, ::Function, ::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
 [3] lowercase(::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
 [4] top-level scope

Because you didn’t load the Unicode package which is in the standard library.

Didn’t you read that?

My point was that it really shouldn’t be necessary - a lowercase function is not really dependent on Unicode,
you can have a lowercase function for ASCII strings, for strings in EUC, or GB, or CP-1252.
It doesn’t make sense to be extending it from a Unicode package.

In general, its behavior is dependent on the Unicode version (as new codepoints get added and case-mapping tables change).

Since the function was moved, you get a deprecation warning for now. In 1.0, it will just be an error. You will just have to do using Unicode.

I agree that there is a tension between the convenience of the “batteries included” approach where everything is in Base (but then can’t be updated separately from Base) and the “modular” approach where you need to import a bunch of modules to do anything nontrivial (but upgrading/compatibility/support is easier). As more things get moved into modules, I’m sure all of us will feel annoyance as something we personally use goes into a module. Hopefully a good balance can be struck.

4 Likes

My point was not that a Unicode version of these functions should be kept in Base, but rather, just have the function names (which are not specific to any character set or encoding) in Base, to be extended by any AbstractString type that needs to, as well as by the Unicode standard library module.

The new state means that if you want to define a pure ASCII, ISO-8859-1, GB2132 or EUC lowercase method, you must have a using Unicode, even if you are doing nothing at all with Unicode, which does not make much sense to me.

1 Like

It doesn’t seem like that much of an imposition. Particularly since even if you are dealing with an alternative character set like GB2132, you probably have code to deal with Unicode too (e.g. to convert to/from GB2132).

I am jumping into this conversation without much knowledge in string processing, but even without sufficient knowledge on this topic, I sympathize with @ScottPJones argument. If something is applicable to any kind of string that is not necessarily Unicode, it doesn’t make sense to have it defined there.

The definition of what it means to uppercase a character is (by our definition) inherently a matter of what the Unicode standard defines it to be. That said, the meaning of uppercase and lowercase mappings are not going to change for several of these encodings, so it does seem reasonable to allow them to define what the generic functions mean without need for a false dependency on the Unicode package.

3 Likes

We can move these functions to Base at any time in 1.x releases if we want. But really, what’s the big deal with importing the Unicode module? Presumably, even if your string is pure ASCII or ISO-8859-1, you’ll want your implementation to be consistent with Unicode, so that’s not a misnomer either. You’d just provide a more limited/efficient subset of the general Unicode method.

1 Like

GB2132, JIS, EUC are not consistent with Unicode, for example, and I’ll be able to support those.
There are other character sets that are not even consistent with ASCII (such as EBCDIC).

In what way? Do they define their own uppercase/lowercase mappings that are incompatible with Unicode?

Yes. For example, even ANSI Latin1 is not really fully compatible with Unicode when it comes to uppercase, even though the characters present in ANSI Latin1 are a pure subset of Unicode, because there are 2 characters whose uppercase versions are not present in Latin 1.


julia> uppercase("µ")[1]
ERROR: Base.uppercase has been moved to the standard library package Unicode.
Restart Julia and then run `using Unicode` to load it.
Stacktrace:
 [1] error(::Function, ::String, ::String, ::String, ::String, ::String, ::String) at ./error.jl:42
 [2] #uppercase#954(::NamedTuple{(),Tuple{}}, ::Function, ::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
 [3] uppercase(::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
 [4] top-level scope

^^^ That’s what I think should be changed, uppercase (and really all of the other functions moved to Unicode)
existed in C for decades before Unicode came along.
It just seems for fitting that Base should just have a generic fallback, that gives an error if the function has not been extended for a particular string type, and say that you need to do using Unicode to get those extensions for the Base String type.
That will give a lot of flexibility for adding optimized versions as other string types are added in packages, such as for the ones in LegacyStrings.jl

julia> '\ub5'
'µ': Unicode U+00b5 (category Ll: Letter, lowercase)

julia> '\uff'
'ÿ': Unicode U+00ff (category Ll: Letter, lowercase)

julia> Base.Unicode.uppercase("ÿ")[1]
'Ÿ': Unicode U+0178 (category Lu: Letter, uppercase)

julia> Base.Unicode.uppercase("µ")[1]
'Μ': Unicode U+039c (category Lu: Letter, uppercase)
1 Like

What should the result of upper casing a Latin-1 string containing either of these letters be?

While I have never encountered these character sets, I can imagine scenarios (eg data archeology) in which some library support for reading them may be useful. However, the first thing I would do in a practical situation is convert to Unicode, and proceed from there.

I think that splitting of Base into packages (which are nevertheless included with Julia) is great, with a lot of long-run advantages. It might be helpful for this discussion if you could sketch a use case for string manipulation with, say, an EBCDIC representation in Julia.

I think in C, you would just get the lower case character back, but in Julia it would feel more natural to throw an InexactError or similar.

For example one could imagine that upper-casing a Latin1 character is implemented something like uppercase(c::Latin1Character) = Latin1Character(uppercase(UnicodeCharacter(c))) and that the conversion back to Latin1 throws InexactError when it can’t represent the given character. (This is just a rationale for the error message. The actual implementation could be done more efficiently and not rely on Unicode.)

Actually, for my string package, I return a UCS2Str, with the those characters converted to their Unicode (BMP) uppercase versions.

EBCDIC is still used on IBM mainframes, such as ones running MVS and z/OS.
For example, from IBM:
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/com.ibm.zos.v2r2.gxla100/ebcdicconsiderations.htm

Also, a lot of the data in the world is still stored using Microsoft’s hack of ISO 8859-1, CP-1252 (so much so,
that if you see a 8859-1 encoding tag for a web page, you are now supposed to ignore it, and treat it as CP-1252, since so many pages are tagged incorrectly).
While that is compatible with ASCII, it’s not compatible with Unicode.

I am not doubting the importance of interoperability with other character sets.

What is not clear to me is why one cannot just convert to Unicode, and work with that in Julia. My understanding is that this would be the use case for islower etc.

So the behavior is actually dependent on Unicode in that sense. I think this convinces me that the functions really do belong in the Unicode package rather than the opposite.

1 Like