I agree completely about there being way too much stuff in Base (why is BigInt, BigFloat, LinAlg, and all the REPL, Markdown, and documentation support stuff in Base, for example? That will need to be discussed in another topic).
However, Stefan has claimed, a number of times, incorrectly, that what I would like to do is:
That is exactly what he has been doing, long before I ever learned about Julia.
He makes statements about my line of work, when he has no idea all of the things I’ve done programming.
If he wishes to talk about his own experiences that have informed his opinions on what is best for high performance, then he should do so, and not try to (incorrectly) discuss other people’s.
Being the performance “guru” for a language/database system known for it’s string handling, and used for all sorts of purposes, I had to make sure that all sorts of use cases were handled as efficiently as possible, not at all just text processing & NLP ones I’d mentioned. I think I’ve brought those cases up before, because in fact, one of the things that has excited me all along with Julia, is I think the combination of the great linear algebra, statistics stuff, with performant string handling (which I knew, given the type system, could be done, something I’ve been saying since 2015), would make it great for the sorts of NLP tasks, and dealing with the massive amounts of unstructured data out there (usually in databases and files, not web pages!).
@ChrisRackauckas - you seem to think that I want to put a whole lot into Base, that’s not true.
I want to have a good basic architecture in Base, that I believe would shrink the size of base, that is more performant that the current code, as well as easier to understand and use for those scientists who really don’t want to have to deal with complicated things like trying to directly process UTF-8 encoded strings.
Anyway - instead of trying to shut down dissenting viewpoints, why not just wait and take an evidence-based approach (something that unfortunately seems gone missing on the world stage over the last few years!), and let the code and architecture speak for itself?
I would really like to see all the benchmarks that Stefan has showing why his changes to String and Char are objectively better - so far, all I’ve seen are three simple functions, one of which doesn’t even test the effects of the changes to Char, another that only iterates over the characters, but never uses the value (which we all know, because of the wonders of Julia, the parts that have so greatly affected the performance of my company’s code likely is optimized out), and only one that did use the value, however, all three tests were done with a single 100% ASCII dictionary file with short words, not even sentences (i.e. the very best case for the current code).
I am currently benchmarking 19 different functions, operating on a variety of text files (including the ASCII dictionary Stefan used!).
I’m very confident that I will be able to prove my points conclusively in the very near future.
It is unfortunate that you construe a technical discussion as something impinging on your professional reputation as a “guru”.
It is apparent from this thread that you are indeed pushing for a specific way of handling strings, which deals with issues you consider important. There is nothing wrong with this: it is clear that you need these features, and I can imagine the possibility of others finding them useful, too.
OTOH, there is nothing wrong with the Julia core team saying that they want to make trade-offs different from your preferences. This happens all the time in open source, and per se does not diminish one’s reputation in any way.
(also, a thread with 140 posts, many of which are yours, is perhaps not the best place to claim that anyone is “trying to shut down dissenting viewpoints” )
The specific part that I felt was damaging to my reputation, was the comment by Stefen, which is totally false:
I believe he needs to retract that statement, and own up to the responsibility for the architecture of strings in v0.3.x and v0.4.x, his sweeping changes in v0.5.x/v0.6.x, and the even further changes in master.
Does perma banning him from discussing and contributing to the JuliaLang GitHub repos counts as “trying to shut down dissenting viewpoints” to you?
I really want to empathize with both, but filtering the discussion, Scott has given fair points.
Create shims of this functions on Base, just for extension.
Inconsistent abstractness in the type tree, we need an Abstract* ie. AbstractChar for Julia magic to work with all types, not just some types.
Unicode.uppercase, in some other package ASCII.uppercase, *.uppercase , etc. …really? It doesn’t make sense, what happened to multiple dispatch in this case?
Some other things I may have missed?
I don’t buy the using Unicode is fine when there is no Unicode related stuff.
I have asked to become a moderator, this is something that has to be handled organically by the community, not hermetically by the same people over and over again in all of the Julia infrastructure platforms.
This is hurting the community, I’m sad and a bit angry honestly, please stop doing this. Mods, please step down and let others do this.
Being an admin is not the same as being a mod, please understand this, we need few trusted admins for security purposes, but we need many trusted mods, specially as time goes by and the forum grows.
I’m seriously planing on abandoning ship, like others have, because of this very same reason. But before that I want to try to help improving this situation.
I didn’t know that the Julia community standards also had “double standards”.
The problems with the new Char type, with performance, complexity, and breaking things that before could rely on Vector{Char} to be binary compatible with vectors of characters (C/C++ wchar_t depending on the system, and in C/C++ 2011 and later, portably with char32_t).
The stated reason for these very disruptive changes is to allow reading, and passing through, strings where the character set and/or encoding is not known.
I believe that the architecture that I am currently implementing makes dealing with that much easier for the Julia programmer, more performant, and much more robust, but I understand that until I’ve published the code, many people may doubt my claims.
There is an on-topic action being taken: @ScottPJones is implementing a string package where he can demonstrate string improvements. This is a win for everyone, as the worst-case scenario is a great string package outside base.
The off-topic issues appear to be a mixture of things, but here is an essential read for those that haven’t seen it: On Open Source.
I’d also like to encourage those interested to first contact the Julia Community Stewards (Julia Community - Stewards) to see if matters can be resolved in the way the community has set up.
This is more than a plea to stop: as a representative of the Julia Stewards, I’m asking everyone to tone it down, and @ScottPJones to specifically apologize for his statements here and in the thread on “postpone 1.0” that insinuate motive based on employment. That is a clear violation of our standards, as it constitutes a veiled threat (“I’ll report you to your granting agencies!”) as a means of trying to exert leverage over others.
Second, matters regarding “bans” and other interactions involving the Julia Stewards are confidential matters. Please do not discuss them in forums. Since @ScottPJones himself brought the issue up in a way that casts himself as a victim, I will clarify one point: if you read the procedures that govern actions of the Stewards, you’ll see that lifting a ban requires that the recipient submit an application to have that ban lifted. @ScottPJones has never submitted such an application, and as a consequence his ban remains in force. But the ball is, and has been, in his court.
I recognize that many of the people contributing on all sides of this discussion want Julia to succeed, but I ask people to spend a little more time asking themselves whether these discussions contribute to that goal.
Please! I made a plea for people who are seen as representing the community because of their connection to Julia Computing to try to hold themselves to a higher standard, for the good of the entire community.
How is that a violation of our standards?
You saying:
I take as an extreme violation of the standards, impuning my motives.
From those very standards (Julia Community - Stewards):
“Where possible, good intentions of the participants should be assumed.”
For the record, I have no intention of reporting anybody to the granting agencies.
That does not mean that I will be quiet here when people (even stewards) violate those standards publicly.
The ban itself was very public. You can’t sweep things about it under the rug just because much later, it was discussed in another context, which in fact led to the very creation of the Julia Stewards, and the process described, which I was not allowed to avail myself of.
I have not discussed anything confidential about what happened.
If you felt that I did so, you could have followed the recommendations in the standards, and contacted me privately, and I could have removed anything you felt was breaching that confidentiality.
Part of that problem, is that none of those procedures were available when I was banned back on Dec. 15th, 2015. I have never been given the opportunity to respond to claims and allegations made.
That process says that the application must have:
acknowledging his/her past violations`
Without a chance to reply to those allegations of violations, it’s hard to plead guilty to something when one believes in one’s own innocence. I know I can be a pain in the butt, and esp. in my very first interactions on GitHub I had to learn how the community worked, but characterizing my comments back then as extreme violations of standards, deserving of a ban, in the face of clear violations from people who have since been made a steward, is very difficult to swallow.
Aside: Unless an individual makes the statement that they are speaking for their employer, it’s just best avoided in forum discussions. We don’t want people to feel obliged to start adding a standard disclosures and other legalese to their sig (“these opinions are my own and don’t represent those of…”, etc.). The problem with this statement is partly that it can eventually be used to cut both ways (e.g. if their connection to JC means they should behave better than the average forum member, does any regrettable comments reflect poorly on their employer too?). Tim’s response was (I hope and assume) hyperbolic in nature, but reflects what may be the concerns of many companies as they decide how they can support open-source.
I was actually talking about not being able to get the same performance as String, because String has special hooks into the C code, and avoids having to have an extra object, and follow a pointer indirection on every access, which means that it is not really comparing apples to apples when comparing the performance of strings in my package, to the ones with special help.
This is a relatively new optimization in Base. When Jeff implemented it, there was also talk about implementing instead the generalized version of it. In time that too will be implemented, but for pragmatic reasons, that likely won’t be finished until v1.1. In the meantime, alternate string types can continue to use the older representation (a struct containing a Vector{UInt8}), or lightly shim it to use a struct containing a single Core.String field (treating it as a just a simple immutable byte buffer). The compiler should hoist the pointer references in the outer struct wrapper, resulting in optimal performance access.
There seems to be several unrelated considerations in this discussion (which is driving Stefan completely nuts): the String representation, the meaning of the Char, the representation of Char, and .
One of the changes lately was for replacing Char(utf32) with Char(utf8). There’s nothing special about Char internal to Julia, so packages can easily define their own primitive type Char 32 end without loss of performance relative to the builtin representation (assuming equally optimized routines written for each) for usage with their corresponding String type. The use of Char(utf8) in Base has a few advantages given that Julia’s String type is assumed to represent UTF8 data, but might be any random bytes (notably, being able to define isvalid as map(isvalid) and that String(collect(s)) === s – prior to his changes, those might either throw errors or replace invalid bytes). If you aren’t using the Base String type, there’s no particular reason you should need to encode your data using the Base Char type either. The Int32 representation of a Unicode scalar is typically just as reasonable a representation for a unicode index extracted from a UTF16 or UTF32 encoded string (just as before the recent string change in Base, when we were using Vector{UInt16} to represent the msvcrt wchar_t* type or Vector{Int32} to represent the unix wchar_t* type – e.g. Vector{Cwchar_t} in both cases).
The second change was to try to move Unicode-specific functions into their own namespace. This helps separate the concerns of how data is represented (such as UTF8 vs UTF32 vs ASCII) from how it is transformed (such as deciding whether the codepoint is numeric). By moving this out of Base, this help clarify the possibility of allowing someone to specify alternate transforms on the same types, such as Unicode3_2.uppercase, Libc.isspace, and perhaps even ASCII.isnumber.
lowercase function is not really dependent on Unicode,
you can have a lowercase function for ASCII strings, for strings in EUC, or GB, or CP-1252.
It doesn’t make sense to be extending it from a Unicode package.
This has already been hashed out above in length, but I think some of the length of this thread (and the frustration of Stefan), comes from this initial mis-understanding. The Unicode standard tries to be the maximal set of all other encodings (in most cases – as you pointed out above, there are exceptions for some Chinese encodings, for example), so you can compute the lowercase character in the unicode code-space and then see if that’s a valid character in the encoding code-space (and chose to error / replace / don’t change instead). However, it is also meaningful (and in some cases, necessary) to ask a different question but on the same data. For example, to support DOS environment variables, we may need to call ASCII.stricmp. Or for macOS, we may want to examine the equivalence of names in the filesystem (Unicode 3.2 Form D for HFS+ and Unicode 9.0 for APFS). Observe that all of these functions work on Unicode characters (and thus can be defined work on any string type), but can return different answers (and thus are different functions).
I do think that people do give more weight to comments from employees by JC, and it would be good if that the HR office of JC gently reminded their employees of that.
I very deeply apologize if any of that was misinterpreted as a threat of any sort - I don’t believe that I have any such power to threaten anybody at JC, and didn’t ever think that anybody reading what I said could have interpreted that way.
I have just been concerned that comments such as the ones made on the other thread, that got even worse when pointed out, could have negative consequences that could affect all of us. That’s not a threat, I genuinely want the Julia project, and the commercial enterprise Julia Computing, to have enormous success, my own success is depending on it. (Some of the recent news, about grants, and funding have been very helpful to us, trying to convince customers that we are not crazy using such a new, not even v1.0 language, and getting to v1.0 will help even more [as long as v1.0 doesn’t gain a bad rep when it gets out the door])
Yes - it’s not nearly as much extra overhead as I’d feared, but for now, it’s still there. I am indeed allocating memory for all Str types using Base.StringVector, which works well, except for that extra indirection, which hopefully will go away as you say, by v1.1.
The compiler improvements in master, BTW, have been wonderful, and for somebody as obsessed with performance as I am, have been an absolute joy to behold!
Unfortunately, in the absence of an AbstractChar type, used consistently throughout Base and packages, that’s not really true, unless you want to (as I am having to do) write new methods using AbstractChar for all of the functions in Base that expect to take or produce a Char (which my package is doing, although hopefully the code in Base can be rewritten to use AbstractChar before v1.0 is released, and then I’d only have to worry about writing optimized methods for my character types that are <: AbstractChar.
This is precisely where I feel that the new architecture for strings in Julia is not good at all.
However, instead of debating it further, shortly you can simply test for yourself which is easier to use, which is more “Julian”, be more performant (not just for a few use cases), which uses less resources, is more robust, more extensible, and more interoperable with other languages.
The problem here is that there are many places in Base that assume that 1) they use Char (currently, all AbstractString iterators return Char, which is very expensive to encode, only to then have to decode to use 2) many functions always return String, even when passed another AbstractString type. (I’ve been very careful to fix that when replacing the functions in Base). 3) map even forces the given function to return a Char, and gives an error otherwise.
Unfortunately, I feel this shows a lack of understanding about those functions, and how people (at least those who deal with text instead of numbers for their living ) expect to be able to use them.
Again, maybe it would be better to wait until I finish my package, and can better show it’s advantages over what has recently been pushed into Base.
Yes, I do believe there has been a serious misunderstanding, about character sets, code points, and encodings thereof, as well as mapping operations such as upper/lower/titlecase, collations, etc., but not on my part.
Are functions on numbers required to be different if they return different answers in Julia? No, not at all.
Thankfully, Julia is not C, and one doesn’t have to do dec32_add(my32bitdecimal_a, my32bitdecimal_b),
or DecFP.add64(a, b), or BigInt.add(a, b), etc.
The mapping tables provided by the Unicode standard are not even recommended by the standards bodies to be used in the way you are saying, the recommendation is that you should be using locale (language) specific tables, in addition to normalization.
As far as different Unicode versions, I think that the recommendation there is to use the most recent version available, not try to have tables for different Unicode versions (they have very strict rules about compatibility between versions, in one famous case, the official name of one character has the misspelling “BRAKCET” in it, which cannot be changed, because of compatibility).
No, not at all. You would do: lowercase(a) == lowercase(b), where a and b, are of type ASCIIStr.
(also, you should never be using any of the str* functions anyway, because of the security issues with nul-terminated strings and buffer overruns).
You might also want to do lowercase(a, locale=locale"en-us") or uppercase("i", locale=locale"tr-TR"), which should return “İ” (i.e. dotted uppercase i), and uppercase("ß", locale=locale"de") must return “SS”.
Also, people want to be able to set their default locale, and then have all calls to uppercase/lowercase/titlecase, and other functions such as isalpha, isalnum, etc. work correctly for that locale.
There are a whole host of issues, that need to be addressed, if you are serious about text.
I have felt since I first encountered Julia, that Julia could be the best language for handling these issues, which I had to address decades ago, first for single and multibyte character sets, and then for Unicode 1.0, then with non-BMP characters in Unicode 2.0, designing and implementing the support myself (because such convenient libraries such as ICU did not exist until after 1999).
Julia has all the right features to make dealing with different character sets and encodings easy for the user, as long as decisions such as this one of “Unicode.islower” are not forced on the language.
All along, my stated goal has been simply, to make Julia the premier language for dealing with text, for all use cases, not just the very limited ones that it seems Stefan has been targeting.
Again, please withhold your judgement until you see my package. I don’t expect to get everything right, and I’d like as much constructive criticism as possible when I publish it (hopefully within a few days).
I encourage hearing dissenting opinions on technical matters, I’ve found I always get better results when I get more input, from people with different use cases, differing needs, different ideas.
So none of the stewards are even going to say anything about the “ridiculous post”. Ok if that is the way it is going to be, you have lost me stewards, from now on Julia is just a tool for me, sadly I don’t want to be related to this community anymore, you have alienated me.
Stewards you only represent yourselves and your pals, I respect your work, but I cannot respect you …so long and thanks for all the fish!
PS: You have splendid community (double) standards, congratulations!
By the way It’s not about Kristofer comment, but about the double standards attitude some of the stewards and mods have shown, that I am upset.
Perhaps we can get this thread back on a technical track. At some point I’d be happy if mods were able to split out the OT posts (technical and othewise) into their own threads. As far as I can tell the initial question could be paraphrased as:
Several string case-related functions (islower, lowercase, isupper, uppercase) were moved out of Base into Unicode, is this a good idea?
It seems like the answer hinges on what the meaning of these generic functions is. If the meaning is unicode-specific, then Unicode seems like the right place. If the meaning if more generic than Base or a more generic Strings module seem pretty reasonable.
I’m going to make the argument that Unicode is an appropriate place for these functions.
Currently the docs for these functions are:
islower
Tests whether a character is a lowercase letter. A character is classified as lowercase if it belongs to Unicode category Ll, Letter: Lowercase.
lowercase
Return s with all characters converted to lowercase.
isupper
Tests whether a character is an uppercase letter. A character is classified as uppercase if it belongs to Unicode category Lu, Letter: Uppercase, or Lt, Letter: Titlecase.
uppercase
Return s with all characters converted to uppercase.
Given that the criteria for upper and lower case is defined as following the Unicode spec it doesn’t seem unreasonable to include it in the Unicode module. That doesn’t preclude other string types that could be uppercased using the same semantics, e.g. I could have MyCompressedStringType that has some arbitrary storage on disk/memory but that I expect to follow the Unicode semantics with respect to what counts as an uppercase or lowercase letter, for which I might define Unicode.isupper(s::MyCompressedStringType), etc.
It also opens the door to have separate generic functions like ASCII.uppercase that might operate on the same types (e.g. Vector{UInt8}) but have simpler/different semantics.
I think one of the points of confusion here is that Unicode defines both the semantics of what is means to be uppercase and also how one might represent those strings in memory. These functions belong in Unicode because they’re using the semantics defined by the Unicode standard, but that’s orthogonal to how they’re represented in memory.
I think that in order for these functions to make more sense in Base than either of the following criteria must be true:
either
Unicode defines the only uppercase/lowercase semantics we need to worry about
or
the uppercase/lowercase semantics we want to use can always be determined by the input type, in which case we could have Base.isupper dispatch to the method implementing the semantics we want.
I think that both sides seem to agree that (1) is false, and I’m skeptical of (2), as it would be nice to be able to define these functions on e.g. Vector{UInt8}.
You should be able to get lower overhead by avoiding the extra level of indirection by allocating a String(n) directly instead of a Vector(String(n)). Even a Vector(n) would be (slightly) lower overhead than using a StringVector(n) backing.
where a and b, are of type ASCIIStr
a and b are Unicode strings (and will get encoded as UTF16). Yeah, it’s probably true that a Libc.stricmp is the more useful function though for this case, (mapping to a ccall LCMapString with the LOCALE_INVARIANT flag) – not ASCII – since Windows claims to map all unicode values to uppercase (probably using unicode 5.0/5.1 or 6.0 or 3.1 – although which one seems to be undefined). I’m just trying to throw out some quick examples of where the distinction can matter (specifically: interacting with a foreign library that, unfortunately, hard-codes some specific translation layer in its API), not a full discussion of the details of such an implementation.
These functions are not defined by Unicode, they have been in the standard C library for over 40 years, long before Unicode was even thought of.
The documentation you reference was written by the people implementing Julia, about their particular implementation (rather non-generic) of those functions, and using that documentation is kind of a circular argument.
The Unicode and WWW organizations both make very clear that the one-to-one mapping tables provided are just kind of a default, and that locale & language specific mappings should be used, generally after performing normalization.
Why would you want to define textual functions on a vector of numbers?
Would you expect to calculate median or mean on a string?
If you want to use string functions, you should operate on something like a String or (with my package) some parameterized Str type, such as a UTF8Str (which would use Unicode character set mapping tables by default, unless a locale or mapping is specified for the default, just like the default precision and rounding modes can be specified for things like BigFloat).
A RawByteStr could also be used (however a RawByteStr would not have any default mapping tables or encodings).
I disagree that either of those criteria you mentioned must be met.
The real criteria is simply whether or not they are generic functions.
Since they are, they should be defined generically in Base (and at least handle the basic one-to-one non-language specific mapping tables from the current Unicode standard, as String is defined as being based on Unicode and is part of Base), allowing people to extend those functions, either by defining methods that work on other AbstractString types, such as the ones from my Strs.jl, @nalimilan’s StringEncodings.jl, or LegacyStrings.jl, or by defining methods that take extra arguments for specifying mapping tables and/or locales.
These julia functions are being defined right now, with semantics that may or may not vary depending on the argument type (which the C stdlib doesn’t help us with). It’s important to reference existing standards, but not always obvious how to do so.
It seems like the crux of this particular issue (where islower etc. should live) hinges on the meaning of those generic functions, so we should define them. Currently the semantics of these particular generic functions (e.g. islower) are defined in terms of the Unicode standard, so they’re placed in Unicode, which seems consistent.
It seems that you’d like a broader definition of the generic function. Perhaps the first step would be for you to write an alternate help string for the generic islower function you prefer, so we have a concrete definition to work from.
They are defined in terms of the default mapping tables provided by the Unicode standard, while ignoring what the standard says about case mapping and folding.
Since those capabilities are a lot more complex (and require doing things like one-to-many (simple) and many-to-one or many-to-many (hard) transformations, my recommendation is as follows:
All those functions stay as generic functions defined in Base, (honestly, nobody should have to do using Unicode just to do something like uppercase the letters in a hex string, or to compare “INF”, “inf” “INF”, “Inf”, efficiently). Just look at all the changes required in a decimal floating point package, which hardly uses strings at all! 0.7 updates by quinnj · Pull Request #50 · JuliaMath/DecFP.jl · GitHub
String (and any other AbstractString that uses the Unicode character set, something I am making explicit in my Str package), should use those default Unicode tables without having to do a using Unicode just to get the names for those generic functions, which are used heavily from within Base itself.
AbstractChar needs to be added to Base (and then I can remove it from my package), and all of the code in Base changed to use AbstractChar instead of Char, when they use Char generically, and don’t have code dependent on Chars internal representation.
More complex mappings, mappings that use a default locale, or are passed a locale or character set as a keyword, can then be added easily, via packages (after Str, I’ll be writing another one Encodings, and maybe another one called Locales, if somebody else doesn’t start helping out and does it first!)