Problems with deprecations of islower, lowercase, isupper, uppercase

Sorry, I meant “has to do with” rather than “is”. Clearly this slip means that I know absolutely nothing about strings or encodings. I do think this conversation has become somewhat less than useful.

Conclusion: we’re going to keep Unicode.uppercase et al. where they are and all officially sanctioned packages adding methods to them will implement Unicode transformations. It’s also perfectly fine to have separate uppercase functions, say TextStandard.uppercase, which implement a different case mapping, but they should not be the same case mapping functions as the Unicode package exports.

As a side note, if \() syntax has advantages over $ interpolation, please implement it and let femtocleaner clean old code :+1: No reason to stop advancing the language when there is a robot to fix software.

Regarding the conclusion stated above, it seems that both sides have strong opinions on the subject. It would be great to bring more people into the discussion before a decision is made? A healthy way forward is to open an issue on GitHub and invite more people involved with string processing to take a look.

Putting myself in the position of @ScottPJones, I wouldn’t appreciate a decision taken without careful discussion with all other members. Specially if it is about a subject that I have expertise on.

Despite the fact that only @StefanKarpinski replied here, several people have commented on the pull request, including Steven, who generally maintains the string code. Please don’t assume we did things the wrong way without checking first.

Also, note that one of the reasons this change was made a few days before feature freeze is that moving things out of Base allows changing the API later if needed, including moving functions to Base if we change our minds. On the contrary, if we keep them in Base we won’t be able to change them again in the 1.x series.

6 Likes

I really try to understand this but sorry what I see is just a declaration and not explanation.

I could try (with my poor English so sorry it won’t be optimal) to explain why I think that AbstractString and encodings has to allow separation from Unicode.

I see more reasons but we could focus on performance here. If I have CP-1250 for example then all chars are bytes there is not necessary to have O(N) algorithm to search n-th character. Uppercase could be implemented as simple translate table ↔ function from UInt8 to UInt8 (or simple Vector{UInt8}(256) index is variable and value is value). No locale dependencies has to be checked. I see reason to not slowdown this function with Unicode complications.

I really suppose that I could something miss but what is reason to change this simplicity with ambiguous and slow Unicode functions?

IMHO if there is need to have Unicode “nonviolent” design then why not to have AbstractUnicodeString in string type hierarchy?

This concern is orthogonal to the present discussion. You can perfectly (and should) implement custom methods to make these operations faster encodings like CP-1250. What we argue here is that custom string types which support a subset of Unicode should implement efficient versions of Unicode.lowercase, and that will be transparent to the user. But custom string types which implement a different behavior should implement a different function (called TextStandard.lowercase in @StefanKarpinski’s example), since it does not follow the same definition. This is consistent with what happens with any type which implements a generic interface: it has to follow the same rules, or generic programming is broken.

4 Likes

I really try to understand this but sorry what I see is just a declaration and not explanation.

Yes, that was just my summary of the conclusion. The explanation is above in this thread.

I could try (with my poor English so sorry it won’t be optimal) to explain why I think that AbstractString and encodings has to allow separation from Unicode.

There is nothing about the string interface that dictates Unicode except that AbstractString is an encoding of a sequence of code points – and the meanings of code points are given by Unicode. This is not really a limitation, however, since Unicode is a superset of all the other character sets you might care to use. You can implement string types for any encoding you want to.

I see more reasons but we could focus on performance here. If I have CP-1250 for example then all chars are bytes there is not necessary to have O(N) algorithm to search n-th character. Uppercase could be implemented as simple translate table <-> function from UInt8 to UInt8 (or simple Vector{UInt8}(256) index is variable and value is value). No locale dependencies has to be checked. I see reason to not slowdown this function with Unicode complications.

You can absolutely implement that transformation. But then you are – by definition – not implementing the Unicode.uppercase function. You can call the function that implements this CP1250.uppercase or something like that. If you know that you have CP-1250 strings and you really want to have very fast case transformations and are ok with the non-standard case mapping it implements, then you can use this function. But that’s a lot of "if"s. When someone else writes generic string code and asks for Unicode.uppercase, it would not be ok if the result they got did not do the expected standard Unicode case transformation. That is why the Unicode.uppercase and CP1250.uppercase functions should be separate.

If you want a function that does some kind of uppercasing and you’re not too fussy about exactly how, so long as the result looks reasonably uppercaseish, then you can define something like this in your code:

uppercase(s::CP1250.String) = CP1250.uppercase(s)
uppercase(s::AbstractString) = Unicode.uppercase(s)

Now you have an uppercase function that is fast for CP-1250 and has a reasonable fallback for other kinds of strings. You can add as many methods to this function as you want to for specialized ways of uppercasing particular string types.

I really suppose that I could something miss but what is reason to change this simplicity with ambiguous and slow Unicode functions?

The Unicode functions are the opposite of ambiguous – they are carefully specified and standardized. They’re also not especially slow, and I have yet to see a situation where string case transformations are performance-critical. But if you find yourself in such a case, you can always use the CP1250.uppercase function.

IMHO if there is need to have Unicode “nonviolent” design then why not to have AbstractUnicodeString in string type hierarchy?

There is no need for this since AbstractString already supports all kinds of encodings.

5 Likes

I wasn’t able to comment there, as I am not permitted to.
Just because the few people who were able to comment on GitHub didn’t raise any objections, doesn’t mean that wasn’t done the wrong way.

3 Likes

Thanks that make sense. And because flexibility of Julia I expect that there is still possibility to make completely different AlternativeString.

What I am afraid of is about something like dec128"123.55". I see no flexibility here. Could we think (1.x) about something similar to python’s b"abc"?

For example could be dec128"123.55"ᵇ parsed as call dec128_str(s::Vector{UInt8})?

But I am not sure if it could satisfy Scott. @ScottPJones could it help you?

I don’t really get what the problem is. We already have b"..." – it constructs a Vector{UInt8} using string syntax (with UTF-8 encoding, allowing invalid data with hex escapes). After recent string changes, b"..." has the exact same effect as Vector{UInt8}("..."), it’s just shorter and can avoid constructing an intermediate string object. What issue are you trying to solve here?

4 Likes

I’m sympathetic, but it’s a lot of work, both to make the change in Base and to get FemtoCleaner to automatically upgrade code. The best way forward would probably be to get the code update working in FemtoCleaner as a proof of concept, and then make the change to Julia’s parser and use FemtoCleaner on Julia itself to upgrade all the uses of string interpolation there. It’s doable, but I do not have the bandwidth to take this on.

1 Like

What does dec128 have to do with this?

This specific issue:

It is annoying to need to depend on Unicode in cases where you know you’re only going to have ASCII data. I had mentioned elsewhere (Slack, I think), that it might make sense to have an ASCII module exported from Base that defines ASCII.uppercase, etc. which do ASCII case transformations leaving all code points outside of the ASCII range alone, since that’s a fairly common need and is easy to implement very efficiently for both String and Char.

I also realized that we consistently use $ for interpolation not just into strings, but also for command objects and expressions – i.e. of values into expressions quoted with :( ... ) and quote ... end. For command interpolation, using $ matches the shell, which is quite nice, and for expression interpolation \( ) wouldn’t work since it’s valid Julia syntax. So if we changed string interpolation then it would be odd one out, which makes it considerably less appealing. On the whole, having to escape $ in strings doesn’t seem so bad, and we can always have custom string literals that implement \( ) interpolation instead of $ interpolation (which is eactly what @ScottPJones’s package does).

2 Likes

One way forward might be to use the “legacy mode” version that I did, which still allows the old escape sequences (that Swift for good reason did away with, as well as the $ for interpolation), in v0.7,
and maybe clean out uses of the legacy sequences $identifier in Base and stdlib as well.
Making FemtoCleaner work to change upgrade all of the uses automatically could be done later, possibly along with a compiler flag that causes usages of the legacy escapes and interpolation to give a deprecation warning.

I still think that 1) eliminating issues with having to quote $ all over the place for things like LaTeX sequences, 2) also not having to remember to quote $ when dealing with applications displaying monetary values (not just US, many other countries use the $ sign) 3) inconsistency with every other language that uses C like string escapes 4) avoiding coupling of the parsing with the Unicode character tables (to figure out when an identifier is finished) definitely trump a minor inconsistency between interpolation in objects (which isn’t nearly as frequent as the use of string interpolation, from what I’ve seen).

3 Likes

I’ve opened a GitHub issue about this potential change:

https://github.com/JuliaLang/julia/issues/25178

3 Likes

Thank you!

2 Likes

In the issue you opened, it doesn’t mention one of the arguments that I touched on here that I think is rather important, the inconsistency of $ needing to be quoted inside of a string compared to most major languages with C like string literals (Java, C, C++, Python, C#, JavaScript, R, Swift, Objective-C, Ruby, Go, Lua, Scala, Erlang, D, Rust, Clojure, etc), or in some of the others in the top 20 (TIOBE ranking) such as VB .NET & VB, Object Pascal, PL/SQL, Fortran that have more primitive string literals without C-like escape sequences.
Assembly languages aren’t really comparable, and Scratch is visual, it doesn’t even have string literals!
Only a few treat $ specially, in the top 20 only PHP and Perl (together only about 2.8% of TIOBE’s total ratings), and quite frankly, I don’t think relatively many developers will be coming to Julia from PHP or Perl, Julia not being at all in the same niche as either of them.

I bring this up because of a comment from @nalimilan on GitHub, https://github.com/JuliaLang/julia/issues/25178#issuecomment-352700938.
Since Julia has chosen to use a C-like syntax for it’s string literals, I think that being more consistent with the vast majority of languages that also use C-like string literal syntax is more important than a minor inconsistency with between expression interpolation and string interpolation. (BTW, I think using $ for expression interpolation is quite fine, and in the past couple years of us (Dynactionize.com) using \(...) string interpolation alongside $ expression interpolation in Julia, via my package, nobody has had a problem with it being different
(and they had had problems with remembering to escape $ as \$ always in string literals)

Responding to @jeff.bezanson’s comment on GitHub:

https://github.com/JuliaLang/julia/issues/25178#issuecomment-352807127

Is two extra characters really that big of a problem?
If you want to actually use $ in a string (think of how common it is for LaTeX strings for example), then
you end up using more characters again, and what about the consistency issue with most every language with C-like string literal syntax?
Before I added the Swift syntax for string interpolation, that was frequently an issue with new developers learning Julia (at work).

Also, I think it would be rather hard to explain to people just why sometimes it works, and sometimes it doesn’t,
based on some huge table of blacklisted characters.

Another point, is that what is a “weird character” to use, might seem perfectly natural to somebody writing in Malayalam, or Korean, or whatever.

On the other hand, there’s plenty of precedent for $ interpolation in Perl and shells. The string literal syntax can be seen as Perl-like rather than C-like.

Yes, but, is that where most Julia developers are coming from?

Also: https://www.fastcompany.com/3026446/the-fall-of-perl-the-webs-most-promising-language