Problems with deprecations of islower, lowercase, isupper, uppercase

No, that was just for that specific case, as part of a Unicode string package, that, like Python 3.x, stores strings that only have Latin1 characters as a vector of bytes, saving space over using UTF-8.

If you were dealing with huge amounts of CP-1252 text, for example, it is much much more efficient to
be able to use functions specialized to that character set, that operate with direct indexing on the bytes.
That would have absolutely nothing to do with the Unicode character set.

That is part of what I am implementing, to make sure that Julia can be the best language for highly performant
string processing.

Also, being able to optimize on validated ASCII, Latin1, or BMP-only strings can give you great performance benefits, as hopefully I’ll be able to demonstrate shortly.

2 Likes

I think there are two cases:

  1. You are implementing the Unicode case transform, in which case you should extend the Unicode.uppercase function (et al.).

  2. You are implementing a different non-Unicode case transform, in which case you should have a separate TextStandard.uppercase function that is separate from Unicode.uppercase.

In neither situation is Base.uppercase needed. The former may require returning a string with a different encoding (as you are doing). The latter is more likely to not require that but is fundamentally doing a different operation.

1 Like

That doesn’t make any sense to me.
If a person has a string, of type Str{:CP1252}, and they wish to call uppercase on it, why should they have to know if it is compatible with Unicode or not?

1 Like

Possibly because, as you explained above, the behavior of uppercase depends on the encoding.

Suppose I have two strings, a and b, in different encodings, but each encoding the same sequences of code points, so that a == b. If I call a case function on them, I should get equal results:

  • uppercase(a) == uppercase(b)
  • lowercase(a) == lowercase(b)
  • titlecase(a) == titlecase(b)

What you seem to be proposing is that the result of calling uppercase on a string should depend on its encoding, which would violate that principle since you could have a == b but because of different encodings of a and b you could end up with uppercase(a) != uppercase(b).

What I’m saying is that the Unicode.uppercase function should always perform the Unicode standard case mapping, regardless of how the string is encoded. If a string’s encoding does not support Unicode-compliant uppercasing within that same encoding, then the Unicode case functions should return a different encoding, presumably String. If there is a different case transformation specified along with an encoding, then that transformation should be a separate function from Unicode.uppercase because they do different things. In other words, it’s fine to have

a == b && Unicode.uppercase(a) != TextStandard.uppercase(b)

but it’s not ok to have

a == b && Unicode.uppercase(a) != Unicode.uppercase(b)

In the former case, the two functions aren’t the same so they should not be expected to compute the same thing. In the latter case, they same function should compute the same thing, regardless of how a and b are encoded.

1 Like

No, not at all.
If I have a string a that is of type Str{:UTF8Str}, and another b is Str{:CP1252},
then you’d still have uppercase(a) == uppercase(b), because the == would end up comparing the Chars,
which are Unicode.
It is not important just how the uppercased characters are encoded in b.
This is the same situation as if you have a string a that is LegacyStrings.UTF16String and LegacyStrings.UTF8String, you are not directly comparing the underlying encoded bytes.
Same with comparing a BigFloat and a Float64, for that matter.

That’s not true.
In these cases, it depends on the character set, not the encoding. If you uppercase a String, a LegacyStrings.UTF16String, or a LegacyStrings.UTF32String, they use the same Unicode uppercase tables.

In that case, the behavior of the uppercase function is inherently dependent on the Unicode package, so it makes sense that you would need to depend on it and extend its uppercase function. Consider that different versions of the Unicode package will implement different versions of the Unicode standard over time, and that by choosing which version of the package one uses, one will actually be opting into different Unicode case mappings. If some string package doesn’t depend on and extend the Unicode package, then how do you know which version of the standard to implement?

???

No, not at all.

Also, it is a fallacy to assume that for different character sets, that uppercase(a) == uppercase(b).
You wouldn’t expect a transformation such as the + operation to always give the same results,
independent of the types, would you?

Actually, that is one of the other problems with the string support in Base (or in Unicode) that I’d would have liked to have fixed for Julia.
As much as possible, Julia the language should be decoupled from any particular version of Unicode.
It wouldn’t actually be that hard, and would probably speed up loading, and probably parsing as well, I have a feeling that the way operators are parsed in Femtolisp, with a huge table, and characters have to be checked various tables and exceptions to determine if they are valid identifiers.

There would still be some coupling in strings that have $identifier, (one of the many reasons I dislike the $ string interpolation, it would have been better if it had been like Swift, which always requires parenthesis, i.e.
"\(identifier)".

I’ve taken care of all of those issues (including the issues of @printf/@sprintf) in my StringLiterals.jl package.
The Unicode, Emoji, HTML and LaTeX tables can be rebuilt at any time without recompiling any Julia code.

That is how I’m asserting that the function should behave. For equal inputs, it should produce equal outputs, regardless of their encoding. Character sets are irrelevant, and encoding of a string is separate from its meaning just like the representation type of a numeric value is independent of its meaning. If you want to violate that principle in your code, I can’t really do anything about that, but I certainly don’t condone it.

You wouldn’t expect a transformation such as the + operation to always give the same results, independent of the types, would you?

Yes, in fact I would. If a == b then I would expect a + c == b + c, regardless of the types of a and b. Of course, we have to deal with machine arithmetic and overflow which does violate that assumption, but there’s not much we can do about that without giving up an unacceptable amount of performance, but in a perfect world, yes, a == b would universally imply a + c == b + c.

Actually, that is one of the other problems with the string support in Base (or in Unicode) that I’d would have liked to have fixed for Julia. As much as possible, Julia the language should be decoupled from any particular version of Unicode.

Since #24999, #25021 and #25069, string behavior in Base Julia is independent of any specific version of Unicode. The mechanics of character iteration depend only on the fundamental structure of UTF-8, not any meaning or tables defined in the Unicode standard. Behaviors that are specific to a particular version of Unicode are isolated in the Unicode standard package.

It wouldn’t actually be that hard, and would probably speed up loading, and probably parsing as well, I have a feeling that the way operators are parsed in Femtolisp, with a huge table, and characters have to be checked various tables and exceptions to determine if they are valid identifiers.

You seem to be confusing the parsing of Julia code itself with the behavior of Julia programs.

There would still be some coupling in strings that have $identifier, (one of the many reasons I dislike the $ string interpolation, it would have been better if it had been like Swift, which always requires parenthesis, i.e.
"\(identifier)".

Since we’ve made "\(" a syntax error already, we could switch to that style of interpolation, which I agree is better since it’s not a valid syntax for anything else. It would just be a lot of code churn to change the string interpolation syntax now. I suppose that with femtocleaner support, it could be done, however.

4 Likes

There is big problem that unicode casing is ambiguous because it is locale sensitive.

So in unicode itself equal input doesn’t guarantee equal output.

These ambiguities are necessary in UTF-8 but not in CP-1252.

That’s a separate issue. As long as both transformations are done in the same locale, they should produce the same result. We do at some point need to provide a way to specify locales, however.

It’s really annoying how many packages are broken because of this change - besides the fact that it doesn’t seem correct at all, as the functions really are generic and not Unicode specific.
You don’t have to do: DecFP.+(dec128"123.55", Dec128(1)), to add decimal FP numbers, do you?

WARNING: importing deprecated binding Base.lowercase into DecFP.
ERROR: LoadError: Base.lowercase has been moved to the standard library package Unicode.
Restart Julia and then run using Unicode to load it.
Stacktrace:
[1] error(::Function, ::String, ::String, ::String, ::String, ::String, ::String) at ./error.jl:42
[2] #lowercase#955(::NamedTuple{(),Tuple{}}, ::Function, ::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
[3] lowercase(::String, ::Vararg{String,N} where N) at ./deprecated.jl:139
[4] top-level scope at /Users/scott/.julia/v0.7/DecFP/src/DecFP.jl:173
[5] include at ./boot.jl:283 [inlined]
[6] include_relative(::Module, ::String) at ./loading.jl:503
[7] _require(::Symbol) at ./loading.jl:434
[8] require(::Symbol) at ./loading.jl:298
in expression starting at /Users/scott/.julia/v0.7/DecFP/src/DecFP.jl:173

1 Like

Am I understand it well that you assert that Julia should not guarantee encoded strings interchangeability between two users with different locale?

How has Base.lowercase to know that it not parse String ? How could you inject here other (encoded) string type?

Why do you think character sets are irrelevant? That’s the main thing (besides the further consideration of locales) that is of importance for this sort of transformations.

Please! There is no perfect world, neither for numeric representations (decimal vs. binary floating point, amount of precision, range of exponents) or string representations (different character sets, some of such as Unicode have multiple encodings).
Just ignoring the issues and the best practices on how to deal with them is not very useful.

You seem to be forgetting that people use the $identifier syntax in strings all over Julia programs, as well as frequent use of parse (now Meta.parse, of course), and doing code generation.
That means that the behavior of Julia programs is not independent of the Unicode version.
Deprecating the $identifier syntax in v0.7 would help a lot though to decouple Julia behavior from dependence on specific Unicode version character category tables.
Making things like character categories, identifiers tables, operators tables, LaTex and Emoji tables (and even operator precedence) loaded from compact tables would allow Julia to be totally decoupled from the vagaries of Unicode, LaTeX, and Emoji table updates.
Note that I’ve implemented support for a lot of that already in my GitHub - JuliaString/StringLiterals.jl: Implement improved string literals with Swift-style syntax for interpolation, hex, & unicode characters, plus C & Python style formatting and Unicode, HTML, LaTeX, and Emoji entities package (and I put in support for “legacy” escape sequences such as $ for interpolation, \U001xxxxx, \uxxxx in a compatibility string macro as well, although one of the big benefits of moving to the \(...) syntax like Swift, of not having to quote $ for LaTeX sequences, US (and other places) money, etc)

lowercase & uppercase should (until some version of the functions that allows for locale information, possibly via a keyword argument) use the standard Unicode tables for the base String type.
However, for something like my Strs.jl package, it would use different tables based on the character set, if it’s not Unicode.

This discussion has been helpful in clarifying one thing for me, so that I’ve now split my Latin1 string type into two, one which is a standard 8-bit character set, and the other which is actually just an optimization of Unicode for storage… the only difference is in one function, uppercase, where LatinStr always returns a LatinStr, but a uppercase on a UniStr that is internal stored as a LatinUStr may return a UniStr that is internally stored as a UCS2Str).

This is essentially the same as what Python 3.x does, which I think most people would agree does a good job at string handling (however, I think I’ll be able to beat Python’s speed, with the string package I’m writing, taking advantage of some of the new features recently introduced into master [fast Union dispatching])

1 Like

Why do you think character sets are irrelevant? That’s the main thing (besides the further consideration of locales) that is of importance for this sort of transformations.

The character set is has to do with how a string is encoded, not what is encoded. The whole premise of abstraction is that those are separate. You’re insisting that how a string value is encoded should affect how functions like uppercase operate on it. That completely violates any kind of abstraction over string encodings. As I said, you’re free to do that, but it’s a bad idea and will make your packages treacherous to use generically.

Please! There is no perfect world, neither for numeric representations (decimal vs. binary floating point, amount of precision, range of exponents) or string representations (different character sets, some of such as Unicode have multiple encodings).
Just ignoring the issues and the best practices on how to deal with them is not very useful.

In the case of strings, there is no reason for violating the abstraction. There is no performance cost to having separate functions that implement Unicode case transformations versus whatever encoding-specific case transformations may exist.

You seem to be forgetting that people use the $identifier syntax in strings all over Julia programs, as well as frequent use of parse (now Meta.parse, of course), and doing code generation. That means that the behavior of Julia programs is not independent of the Unicode version. Deprecating the $identifier syntax in v0.7 would help a lot though to decouple Julia behavior from dependence on specific Unicode version character category tables.

I’m not forgetting that, it’s just a completely orthogonal concern from how Julia programs process strings, so I’m not sure why you keep bringing it into this conversation.

Making things like character categories, identifiers tables, operators tables, LaTex and Emoji tables (and even operator precedence) loaded from compact tables would allow Julia to be totally decoupled from the vagaries of Unicode, LaTeX, and Emoji table updates.

Decoupling and independence are not the same thing. I really would not want the meaning of my Julia program to depend on some data that’s not even fully specified by the Julia version I’m using – that’s a complete reproducibility nightmare.

Note that I’ve implemented support for a lot of that already in my GitHub - JuliaString/StringLiterals.jl: Implement improved string literals with Swift-style syntax for interpolation, hex, & unicode characters, plus C & Python style formatting and Unicode, HTML, LaTeX, and Emoji entities package (and I put in support for “legacy” escape sequences such as $ for interpolation, \U001xxxxx, \uxxxx in a compatibility string macro as well, although one of the big benefits of moving to the \(...) syntax like Swift, of not having to quote $ for LaTeX sequences, US (and other places) money, etc)

Yes, as I’ve said, it’s a nice syntax. My main reservation is that using it would introduce a very large amount of code churn in the ecosystem.

Please go do some reading on the differences between character sets and encodings.
Until you do so, it doesn’t seem to be useful to continue trying to discuss this with you.