Problems with deprecations of islower, lowercase, isupper, uppercase

Sorry to ‘derail’ this topic, but I have feedback related to the subject line. I’m just a user of Julia and have been delighted with its brevity relative to Python, which most of the scientific community seems to be charging towards. I’m afraid I know nothing of different string encodings. I just ran some of my code in the latest master and was caught by this.

I understand there is a tension between including functionality in Base by default and splitting it off to packages that can be easily updated, but I feel in this case, from a humble user’s perspective, this may have gone a bit far. I have a few points which occur to me:

  1. Operations like uppercase etc. are so basic that importing another module seems like overkill. Many languages, like Python (even Bash after v4!), supply this out-of-the-box. One of the advantages of Julia is that I don’t have to write import numpy as np at the top of every single file just to have arrays. I would hate to do have to do using Arrays in Julia! (Or even using Arithmetic!)

  2. If this does have to be taken out of Base, then Unicode is I think a slightly unintuitive name for us mere scientists. For me it sounds like this is for dealing with strange characters, not just, as an example my script just failed with, transforming a variable name to a file name. How about Strings?

  3. Throwing an error and telling the user to restart Julia when encountering a use of uppercase is a bit stiff. Would this not normally be a deprecation (and associated warning)? Having to do this after a long REPL session with lots of stuff in memory is pretty galling. (I know I can do Base.uppercase(x) = Unicode.uppercase(x) or using Unicode in a juliarc.jl file, but it’s the message which packs the punch.)

I’m guessing that the Base.String type is UTF-8(?), so surely you can’t update that without updating Base as well, in which case why make functions like uppercase unavailable on the basis that Unicode may change too quickly? (I don’t know the details so may be wrong.) Does string handling change so quickly that users can’t wait for a 0.0.x update?

I understand my input won’t really bear any weight, but I thought it might be useful at some point to hear an average user’s point of view on where to draw the line between ‘batteries included’ and ‘separately upgradable’.

9 Likes

Well that’s certainly a misunderstanding. As many people know, I dislike string interpolation syntax, period. But I acknowledge that it has conciseness benefits. My point is that if we’re going to lose some of that conciseness, it calls the whole feature into question.

1 Like

You say this as if it’s some perverse thing we implemented due to ignorance, but isn’t this just a property of UTF-8? UTF-8 is not a crazy default encoding choice. What default encoding should we use instead that would be universally better?

You’re distorting Stefan’s suggestion that people use character comparisons into a claim that all character predicates should be implemented that way. Obviously they need to be, and are, table driven, and those table lookups could even be optimized for the new Char representation, though that hasn’t been done yet.

True, but how is that related to Char-to-Int conversions?

In what way are we doing that? I understand that UTF-8 is not optimal in all cases, but that’s quite different from e.g. causing data corruption. We are also being quite clear e.g. in stipulating that Unicode.uppercase only does unicode case mappings (for some version of unicode) — in fact perhaps we’re being too strict about that. We have also always been clear that Base.Char represents unicode code points. Yes, that’s incomplete and doesn’t cover all use cases, but that’s not the same as making incorrect assumptions and getting wrong answers.

1 Like

On the special representation of String: I agree, and it is definitely in the roadmap to generalize that and allow other types to use the same representation. For now, wrapping a String instead of a Vector will take advantage of some of its efficiency. In some cases wrapping a tuple of integers works well too (they are fully inlined in structs).

Too many people love string interpolation to remove it, and, with the form I have it in StringLiterals.jl to unify literals and interpolation, without causing problems with compatibility with other C-like language string literals (i.e. ONLY \ and " are special and need to be quoted, and what follows \ is pretty much the same between languages, but people understand that there may be some differences in the escape sequences).

What about my idea, which costs only a single character if you have a lot of string interpolation of simple identifiers (frankly, I find that’s rare for my uses now, because usually I want to have a bit of formatting), is used to mark interpolated strings in other languages (and gives an indication to the user that this is NOT your normal C-like string literal), i.e. $"This is my interpolated string, with $foo, $bar, $baz" ?
$"..." formatted strings could accept both legacy style interpolation (possibly with Stefan’s whitelist/blacklist idea, although I might simply say that you make it easy to explain, and say that the identifiers must be ASCII (or at least Latin1) (and so not need loaded tables, and avoid the dependence on a specific version of Unicode that I had brought up - the code would be easier to read internationally, only if you had identifiers with characters > 0xff would you need to use either $(...) or \(...)).

I know that @stevengj said he didn’t buy the “LaTeX” arguments, but from his own documentation of https://github.com/stevengj/LaTeXStrings.jl, I think that this would actually be very useful to people using his package.

You can also use the lower-level constructor latexstring(args...), which works much like string(args...) except that it produces a LaTeXString result and automatically puts $ at the beginning and end of the string if an unescaped $ is not already present. Note that with latexstring(...) you do have to escape $ and \: for example, latexstring("an equation: \$1 + \\alpha^2\$"). One reason you might want to use latexstring instead of L"..." is that only the former supports string interpolation (inserting the values of other variables into your string).

If this change were made, then only the \ would need to be escaped, which any LaTeX users from any of the languages with C-like string literals would be familiar with (and NOT familiar with the need to escape $!)
So, his example would become simpler: latexstring("an equation: $1 + \\alpha^2$"), and there are no problems using string interpolation using the Swift ‘(…)’ syntax.

I think that could be added very quickly to the FemtoLisp parser, and I already have the code for the rest (the only difference is that instead of f"..." and F"...", it would be "..." and $"...")

Currently, I’m using struct Str{T} <: AbstractString; data::Vector{UInt8}; end, and using Base.StringVector(n) to allocate the Vector, so it is the more efficient structure you introduced.

That does allow me to share the vector with a String, if the String is completely valid, for my types ASCIIStr (if only ASCII characters are present) and UTF8Str), and I also wanted the structure to be the same for mutable strings, but I haven’t started implementing those yet (I just started last Wednesday, but have most everything that String does implemented, and am just ironing out some last little corners like this issue of the extra indirection).

The problem is thinking that if you have a hammer, every problem in the world is a nail :slight_smile:

UTF-8 is exactly the right solution for things like Web pages, data transfer, and text in parts of the world with languages where characters outside the ASCII range are the exception rather than the rule.
However, it is not good at all when you are doing a lot of text processing, or for storage of text written in languages used by probably about 3 quarters of the world’s population. Doubling the amount of storage required, moving from something like SJIS or GB to UTF-8, is simply not acceptable to many customers, rightly so! (That was a major competitive advantage that we (at InterSystems) had, compared to other database vendors, who tried to push UTF-16 or UTF-8 on their Asian customers, because I had a compaction scheme for Unicode that stored the text in even less space than SJIS, instead of incurring around a 33% penalty (1.5 bytes per character average to 2 bytes) for using UTF-16, or generally 100% penalty for UTF-8 (1.5 → 3 bytes).

I don’t believe that I’m distorting it. People who deal with this sort of stuff already have these tables, based on the Unicode codepoints. Do you really think that to claw back the performance lost, they’d have to redo all their tables, using Stefan’s rather complicated encoding, which would then only be useful for Julia? Using standard Unicode codepoints, those tables can simple be stored in a compiled library, shared and used by all programs on a system.

The reason I brought that up, and the recommendations of the W3C, is that the default upper/lower/titlecase mappings for Unicode are not something fixed and cast in stone, and they are not even recommended for user applications, which really should be locale specific, which means you need to deal with locale specific tables,
constantly going back and forth between Char and UInt32, which, until #24999, was a no-op.

Because of the principal of “Garbage In, Garbage Out”, and Stefan’s philosophy about handling text data (which might have been OK at someplace like a web retailer such as Etsy, but doesn’t fly at all when people’s lives or their livelihoods are at stake, such as with the sorts of medical and financial applications that were written in the language I was responsible for).
The problem isn’t UTF-8 at all, it is allowing misidentified data to be input, and not detected at the first point it is encountered. Many times, unless you have available the entire file read in, you may not have enough information to try to auto-detect an encoding, or if you store it away, possibly changing sequences that it thought was “invalid UTF-8”, you might not realize until you can no longer get the original data, that it really was something like CP-1252. Then when that data, which might be critical, is needed, you found out it’s unusable garbage,
because you have no way of telling what character set really was.

Stefan wants to avoid errors, from what he thinks of as “invalid data”, but, in my experience (and as he should remember from a number of incidents over the last couple of years, with people having problems with Julia), where the Julia code simply assumed that everything was UTF-8, and didn’t have good tools for developers, to deal with the issues, such as:

  1. make sure that all immutable strings are always valid (which also can give major performance advantages
    when processing)

  2. have a reasonable way of dealing with inputting variations of encodings (such as the variants of UTF-8, caused by issues of long encodings, like Java does with \0, or where two UTF-16 surrogate characters are encoded as 2 3-byte UTF-8 sequences, instead of the correct 4-byte sequence, which will mean that the UTF-8 string will not sort or hash correctly, if you don’t produce valid UTF-8)

  3. Allow for as much as possible, auto-detecting the most common character sets, and make it easy to call external tools that can do an even better job of character set detection, and detect things like UTF8BOM, 16-bit and 32-bit BOM and byteswapped BOMs, as well as UTF-16 simply widened to 32-bits (this does happen!).

  4. having both “safe” conversions, that give an error, or “unsafe” conversions, that may return a raw vector of bytes, 16-bit words, or 32-bit words, if the input data is not valid for whatever character set/encoding you had identified it as, as well as giving the option replacing invalid sequences with zero or more characters, or even passing a function to be called for each detected sequence of invalid code units, that can either give an error or return zero or more code units (or possibly code points) to be inserted in the output instead of the invalid sequence.

I was surprised that people decided on such a non-generic (to me, non Julian) way here.
If you have a numeric type, and a transformation such as ‘+’, do you have to call ‘UInt8_add’, ‘Dec128_add’, or ‘BigInt_add’ to perform the operation? Does it matter if the result of a + b is not the same if a and b are UInt8 or Dec128 or BigInt? No, not at all! 0x80 + 0x80 = 0x00, but 128 + 128 = 256.
So, why should somebody have to load a specific “Unicode” package, and then call functions specific to that package, just to do a generic operation such as ‘uppercase’?
So, if I have a Str{:SJIS} type string, and I perform some operation such as uppercase on it, you should be able to do that, without worrying about Unicode at all. Other standards, such as GB 13080, have their own ideas of default mapping tables, which should be respected.
Base.Char definitely should respect Unicode (and with #24999, it really doesn’t anymore, because you could have any sort of invalid character). A Chr{:SJIS} though, to represent SJIS standard codepoints, should respect that standard, the same as a codepoint type for GB 18030 strings.

These deprecations of islower, lowercase, etc. are also causing a lot of churn, in code such as DecFP.jl that simply was trying to deal with ASCII characters.

It also doesn’t make sense to me, that if Julia itself is using Unicode tables for identifiers and to tell if something is upper, lower, alphabetic, numeric, etc., and Char is defined as representing a Unicode code point, and String as logically a collection of Char, why that isn’t simply part of the Base language?
I realize that people are concerned about tying the language to a particular version of the Unicode standard, but this trying to make people have to do using Unicode, for things that are already present (but now hidden) in Julia, I don’t believe is the best way of trying to handle the issue.
Instead, I believe that the tables could be autogenerated, based on both the Unicode data and Julia’s exceptions (for example, for the identifiers), and compiled into a shared library, instead of having so much, such as what things are valid identifiers, hard coded into places in utf8proc and the FemtoLisp parser, and have other behavior (LaTeX and Emoji tab expansions) hard coded into the REPL.
That is the simple technique which I used for the Unicode, HTML, Emoji, and LaTeX entity tables to provide StringLiterals with the ability to do things like “<dagger>” or “:smile:” as part of my extended escape sequences.

If you want to discuss this further, I’d be more than happy to - I know you are supposed to be on vacation now, so maybe afterwards (by then, I should have my Strs.jl package well tested with lots of benchmark data, pushed to GitHub).

4 Likes

Please, do get your facts straight! Although the Swift compiler does store string literals encoded with UTF-8, internally processing is done with ASCII or UTF-16, (remember, that Objective-C, like Java, uses UTF-16 internally).

Swift String values contain an instance of the _StringCore type, which is optimized to store either ASCII or UTF-16 encoded text.

Here is a rather interesting article about the performance problems caused in Swift by their complicated ‘Char’ type (which actually represents an extended grapheme cluster, not just a single code point):

As far as Go and Rust, yes, the default libraries support UTF-8, which is fine, but just like C or C++, you can use whatever package for strings that you want, and use UTF-16 or UTF-32, if you want to have decent performance for text processing.
(We are also using Go now at Dynactionize)

It is perfectly obvious to everybody that different encodings have different tradeoffs. The question is what should be the default. Maybe we can reduce the number of cases where the default is used, but I’m pretty sure we will still need to pick a default in some cases. You keep making irrelevant, tautological arguments like “doubling the amount of storage is bad”. Really? I thought wasting memory was good all this time! So supporting other encodings is great, but I will continue to believe UTF-8 is a reasonable default.

Finally. Thank you.

Yes, fair enough, but this is really just missing functionality. Leaving aside that auto-detecting encodings is a very hard problem (I know there are decent libraries for it, but they still have to guess to some extent), it’s just a lot of work to have a full multi-encoding environment. If your package provides it, that’s fantastic. Just don’t confuse our lack of some sophisticated functionality for some kind of intransigence, ignorance, or malice. We have limited resources as well as a desire to keep the base system simple.

Previously, we would translate invalid data to replacement characters. Now, we are at least able to copy it faithfully. IIUC, you want an exception thrown as soon as possible. Ok, this can continue to evolve, but in the meantime I don’t see how the new behavior is any worse. In particular, the new Char representation does the opposite of “assume everything is UTF-8”, as it goes out of its way to preserve data that isn’t UTF-8.

But our approach is that if you naively e.g., read a line from a file, the data is preserved. It might be nifty (if expensive) for us to call a library to auto-detect encoding, but that can still get it wrong. You absolutely have the option of using packages that provide other encodings to try to better handle the resulting String. I don’t think anybody would object to APIs for I/O in other encodings either.

I don’t get this. utf8proc’s tables already get updated for new unicode versions. I see you have e.g. an Emoji_Entities package providing mappings, but we also have a table of mappings plus a script to generate it from the W3C mapping file. What’s the difference? Is it just that the entity list is a separate package? If so, then yes I agree it would be fine to maintain these lists outside Base.
And emojis are one thing, but do you really want the rules for valid identifiers to be “pluggable”? That just seems like a terrible idea to me.

I appreciate that, but I’m back now :slight_smile:

How does julia not let you use other string packages or other string encodings? You say “the default libraries support UTF-8, which is fine” — but it wasn’t fine a few posts ago, when you were railing on about how this choice is worse for e.g. Asian scripts.

ASCII is obviously fast, but how is UTF-16 so great? It approaches 2x space overhead for mostly-ASCII data, and still doesn’t have O(1) indexing. Are you saying we should use UTF-16 internally? Again, yes it’s good for some data, but it’s hardly a universal improvement.

It doesn’t seem so bad to me to have uppercase and lowercase in Base. Is that the full set, or should all the character predicates be in Base as well?

6 Likes

UTF-8 is a reasonable default for I/O. It doesn’t make that much sense if you are doing more than minimal processing of the data, or storing the data outside of places with Western European languages.

Have I ever said that it was because of intransigence or malice? I do believe that there is a lack of knowledge about all the ins and outs of dealing with character sets, transformations (such as upper/lower/title, for I/O, etc.), dealing with collations, locales), achieving top performance with string operations. That’s not meant to be an insult, or criticism, different people have different knowledge, and the most of the people contributing to Julia have quite a bit of knowledge on other things like linear algebra, esoteric math, etc.

[I deeply respect the great ideas that came together from the 4 creators, to make the language I love the most, as well as all the amazing contributions from the other many contributors to the project, I just want to see string handling get the same love and attention to detail as the type system, math, linear algebra, use of LLVM, etc. have received]

Well, there never really was consistent support for handling invalid data in Julia, and it hasn’t gotten much better, there are places where replacing with uFFFD is hard-coded into the function, for example.

For other types, such as for arrays and numbers, Julia typically is careful to give an exception as soon as possible, why not for strings? Also for dealing with pointers, there are specific unsafe_ versions of functions, so somebody using those functions is made aware that they are doing something potentially dangerous.
UInt(-1) gives an exception right away, it doesn’t just convert to 0xffffffffffffffff (but, Julia is nice in that you can use either %UInt or reinterpret(UInt, -1)). Julia gives bounds errors (but also has a way of eliding those checks for performance).
The same principles should apply to strings as well.

So, string conversions should 1) by default, be safe, and throw errors, 2) allow for keywords (now nicely fast) to indicate a replacement strategy, either using a default character such as SUB ('\u1a') or '\uFFFD', or a string, or a function to call with the invalid character to generate the replacement string, as well as 3) have a way of preserving the input data if it is invalid. That is what I have implemented.
So, for the person naively reading a line from a file, they will always get back something of type Str.
Depending on whether they call a constructor such as Str, convert, or unsafe_convert, and what keywords they pass, such as convert(UTF8Str, str, replace='?') or unsafe_convert(UTF8Str, str), you would get back ASCIIStr("Mand? 100 ? a Espa?a") or RawByteStr("Mand\xe9 100 \x80 a Espa\xf1a") when presented with the CP-1252 string b"Mand\xe9 100 \x80 a Espa\xf1a").
Therefore, the pass through can happen, very quickly, without any need for allowing invalid characters in Char or in UTF8Str (which is really just an alias for the type Str(enc"UTF8"), as RawByteStr is an alias for Str(enc"Byte")).
If you read in a string, and you get a RawByteStr back, because it is not valid UTF-8 (or whatever you though it was), then you can just write it back out, and always get exactly the same string back.

Note: detecting between ASCII, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and whether something is really either some ANSI 8859-x (usually 8859-1) or CP-1252…CP-125x is not really that difficult, the problem is more in if something looks like one of the ANSI 8859 character sets, or one of the Microsoft versions of those same code sets, such as CP-1252, then you might need a hint, to say for example that you expect it to be either 8859-1 (if no characters between 0x80 and 0x9f are detected), or CP-1252 (if they are detected between 0x80 and 0x9f), or if it’s likely to be 8859-15 instead of 8859-1.

It’s that they are not compiled into the Julia image. That slows down the build process, as well as takes a more memory than if the LaTeX and Emoji tables, which are only needed by the REPL, are loaded from files, when then can be updated at any time.

Actually, we had to provide that for our customers, locale specific identifier tables, when I architected the Unicode support for InterSystems, because customers wanted identifier tables that matched their national character sets such as S-JIS or GB xxxx.
However, for Julia, what I think would be better would to have those tables compiled into a shared library
(essentially, you are already doing that for most but not all of your identifier information, but you have some based on whatever UTF8proc library is being used, and what version of the Unicode standard it uses, then combined with data hard-coded into the FemtoLisp parser, which I think is a really terrible idea!)
Instead, all of that could be removed from the FemtoLisp code (even including such things as the lists of valid operators, and the precedence lists), and instead have a shared library created when Julia is built, based on whichever Unicode standard you want, and the Julia specific identifier, operator, and precedence lists.
That would be shipped as something like: julia-v6-2-unicode10.so.
That makes for faster loading, less memory needed, and ability to upgrade Unicode versions separately from the Julia version (but not opening it up to allowing everybody to have their own specific sets of valid identifiers, which I believe is your concern).

It is a very poor choice for processing text in the languages used by 3/4s of the world’s population.
I have never said that UTF-8 should not also be well supported (it needs to be better supported than it is now, really), nor that it isn’t a good default for certain things such as encoding for I/O.
What I’ve said is that it is not good as the one and only string type in the base language.
Actually, with my approach, you’d have the parameterized type Str, which can handle encodings, valid strings, “raw” strings, binary strings, as well as an optimized Union type, UniStr, that stores strings in their optimimal form, like Python, as either ASCII, Latin1, UCS2 or UTF32.
From what I’ve seen, most Julia programmers are scientists, and really don’t want to have to worry about all the intricacies of UTF-8 encoding, and there are constant bugs from people dealing with the codeunit instead of codepoint indexing.
My UniStr type makes life much easier for Julia programmers by being only indexed by codepoint.
If you really want to deal with all the complexities of trying to directly process UTF-8 data, you still can in my scheme (and a good deal faster than with String, because you know the strings only contain valid data)), and it also has major memory savings dealing with text from all of those Chinese, Japanese, Hindu, Arabic, etc. speakers.

Right, that’s why you don’t use UTF-16 either, you use UniStr (that could simply be String, if you agree with my ideas here).
For the 1/4 of the world where ASCII / Latin1 handle most all text needs (I realize Emojis are a special case, if you are dealing with things like Twitter feeds instead of hospital records, but they are 4 bytes in all Unicode encodings, UTF-8, UTF-16, and UTF-32, so it makes no difference).
With my package you still get much better and faster support for UTF-16, for better interoperability with Windows, Java (and all the languages built on the JVM), languages like Objective-C and Swift,
libraries like ICU. Converting from ASCII/Latin1 to UTF-16 is simply a widening operation, which can be done very quickly with SIMD instructions, and narrowing operations can be similiarly optimized (all of which I’ve done in the past, for various architectures, Intel/AMD, POWER, and others). Converting from UCS-2 to UTF-16 is a no-op, and other conversions such as UTF-32 ↔ UTF-16 can also be done very quickly via SIMD instructions (I need to learn LLVM IR to be able to do this, without having to implement the conversion code directly in assembly, as I did in the past).

Doing a single dispatch on a Union type, (Str{enc"ASCII"}, Str{enc"LatinU"}, Str{enc"UCS2"}, Str{enc"UTF32"))
only on the top level call, and then not having to do conditional branching on every single byte to determine if 0 - 3 bytes follow, is a huge performance win. If people know they are only dealing with ASCII characters, or Latin1 characters, or BMP characters, etc. they can eliminate even that very small overhead.

So, yes, it pretty much is a universal improvement on using UTF-8, the only place where it would take more space is if you had a long string mixed with emojis etc, but processing those strings would still be much faster than using UTF-8, and the emojis (or other non-BMP characters) themselves do take exactly the same space in both UTF-8 and UTF-32).
I haven’t heard people complaining about the speed or storage needs of text in Python 3, which does exactly
what I am proposing (but doesn’t have the nice ability of Julia, to directly generate optimized code for each of the 4 possible “internal” Unicode types used in the union UniStr).
This is why I feel that Julia, with a string architecture like I am implementing (but in Base!), would end up being the premier language for doing text processing, and combine high performance on strings with fast numeric analysis for NLP (which is precisely what I’ve been doing for the last 2.75 years at Dynactionize)

For now, unless a more Julian API is added (which I’d made a PR for over 2 years ago, but it was still being bikeshedded when I was banned, and then my couple of outstanding PRs got summarily closed :frowning: ),
yes, the full set really should be in Julia, and since the Char and String types are in Base, and state they are Unicode, it doesn’t make since to make people take the extra step of using Unicode (and restarting Julia and losing all your state!) just to use very generic functions (existing long before Unicode!)
Maybe in v2.0, if a more Julian API is accepted for the is* predicates, those could be deprecated, but do they really hurt to leave in?


So, in a nutshell, I think that by default, the standard string type in Julia should be UniStr (renamed String, of course), with UTF8Str and UTF16Str fully supported, and the default encoding for I/O should be UTF-8.

6 Likes

Ideas are likely to be avoided if they promise to bring with them a series of migraine-inducing conversations – even good ideas.

Merry Christmas

2 Likes

Better to have a few migraine-inducing conversations now, rather than endure them for years to come as new people discover the shortcomings and problems in the current design, IMO.
Also, nobody is forcing you to participate in the conversation, but if you are interesting in doing lots of string processing or NLP work in Julia (as we’ve been doing for the past almost 3 years where I work), I believe the possible pain is rather justified.

Pain is having to write a whole string package the week before Christmas, because a change was put into Julia that killed performance, right as we are trying to get contracts signed, and had planned to move to v1.0 as soon as possible when that was stable (customers are a bit nervous of going live on v0.x software!)

4 Likes

To possibly catch people’s interest, here are some results (and I haven’t even started to optimize yet, and there is the issue of the extra overhead that String doesn’t have because it is handled specially in the C code):

Telugu.txt: 7519 lines, 505822 characters
[76016, 0, 429806, 0] # there were 76016 ASCII, 0 Latin-1, 429806 16-bit (UCS-2), and 0 UTF-32)
[0, 40, 0, 7479, 0] # there were 40 lines that were pure ASCII, and 7479 that had at least one UCS-2 character
# 0 empty lines, 0 lines with just Latin-1 or ASCII, and 0 lines with at least one non-BMP character

  String:      Bytes: 1365434    Chars: 505822        2.699 bytes/char
    sizeof:                     1365434     15.842µs      2.107ns      0.031ns      0.012ns
    length:                      505822   2146.636µs    285.495ns      4.244ns      1.572ns
    Chars iteration:             505822   3321.970µs    441.810ns      6.567ns      2.433ns
    isascii on string:            76016   3517.758µs    467.849ns      6.955ns      2.576ns
    isvalid on string:           505822   4688.878µs    623.604ns      9.270ns      3.434ns
  UCS2Str:     Bytes: 1011644    Chars: 505822        2.000 bytes/char
    sizeof:                     1011644     16.716µs      2.223ns      0.033ns      0.017ns
    length:                      505822     17.972µs      2.390ns      0.036ns      0.018ns
    Chars iteration:             505822    566.310µs     75.317ns      1.120ns      0.560ns
    isascii on string:            76016    666.204µs     88.603ns      1.317ns      0.659ns
    isvalid on string:           505822    641.735µs     85.348ns      1.269ns      0.634ns
  UTF32Str:    Bytes: 2023288    Chars: 505822        4.000 bytes/char
    sizeof:                     2023288     22.010µs      2.927ns      0.044ns      0.011ns
    length:                      505822     18.020µs      2.397ns      0.036ns      0.009ns
    Chars iteration:             505822    842.835µs    112.094ns      1.666ns      0.417ns
    isascii on string:            76016    904.346µs    120.275ns      1.788ns      0.447ns
    isvalid on string:           505822    851.557µs    113.254ns      1.684ns      0.421ns
  UniStr:      Bytes: 1010938    Chars: 505822        1.999 bytes/char
    sizeof:                     1010938     31.619µs      4.205ns      0.063ns      0.031ns
    length:                      505822    169.054µs     22.484ns      0.334ns      0.167ns
    Chars iteration:             505822  29766.863µs   3958.886ns     58.848ns     29.445ns
    isascii on string:            76016  63246.861µs   8411.605ns    125.038ns     62.563ns
    isvalid on string:           505822  46753.958µs   6218.109ns     92.432ns     46.248ns

The last 3 values for the Union type, I’m investigating, I’d hoped the new code for speeding up Union dispatching would have made that fast, I may have to do manual dispatching to one of the 4 types for now
(and will definitely have to do that for this to perform well on v0.6.2).

I think that @viralbshah and the other people in JuliaComputing’s Bangalore office might be interesting in the better performance and lesser storage requirements.

Note: I’ve got benchmark results for about 15 languages already, I’ll put up a gist after I add a bunch more tests of is* predicates, upper/lower/title case mappings, collect, searching, etc.

1 Like

These are impressive at first sight, but once you realize that UCS-2 doesn’t support all Unicode codepoints and that UTF-32 takes 4 bytes per codepoint, it’s less appealing, at least as a default type for a language. Finally, your UniStr type is interesting in terms of memory use, but in terms of performance it’s much worse than String (at least currently, Union optimizations haven’t all landed yet).

1 Like

We really need more tests to make qualified opinion but IMHO sizeof comparison: 2023288/1365434 (= 148% memory) is not so bad trade off for 400% speed gain (for char iteration).

Edit: I am not discussing here about using UTF-32 as internal String representation! (although it could be question too). But there are for sure problems where this trade off is worth to pay.

There aren’t any more.

That’s exactly what the String type does.

Yes, that’s exactly what String does. It’s not called UTF8String anymore, and this is one of the reasons. One can load a package such as yours providing UTF8Str and get exactly the behavior you describe with String instead of Str or RawByteStr.

It is the tiniest imaginable fraction of the build time. But I agree that it’s good to move things out of Base and make it more modular, which we have been doing. I’m pretty sure the REPL stuff will go out at some point.

Ok, I think this is the key point. Julia at this time explicitly only supports source files in UTF-8-encoded Unicode. Supporting source in other encodings might be nice, but it has not been a priority yet.

You can’t be serious. The code and data associated with the parser is tiny.

The request to be able to change unicode versions independent of the julia version is valid, but frankly much fussier than most people seem to need. In the fullness of time, sure it would be a nice thing to do, but priorities priorities.

So all this comes down to is that you want more string types in Base? Ok, that’s a valid opinion, but you just defended Go, Rust, and C/C++ saying “yes, the default libraries support UTF-8, which is fine, but just like C or C++, you can use whatever package for strings that you want”. Why is it ok for them to put other encodings in packages, but not us?

Look, I think your UniStr type is cool, and a totally reasonable thing to use, but there are tradeoffs. We would have to look at a large range of benchmarks and scenarios to decide that it’s generally better.

They can speak for themselves, and we talk to them every day. There are Julia users in many countries, but we just don’t see many complaints about UTF-8.

5 Likes

This is a bit dramatic. Surely you’re not going to ship software with a 0.7.0-dev version of julia? You wouldn’t have to do this the week before Christmas if you gave us time to fix performance before releasing, first.

3 Likes

I never said that either UCS2 or UTF32 should be used as a default type for a language - please read what I’ve written all along. The point was to show a bit better what the performance with UniStr would be, after the issues of String having “special” access and the Union optimizations not having landed, are dealt with.

Remember also - these are results from only a week of working on the issue part time, while trying to enjoy our Xmas break that started last Friday. (I would also say that the amount I’ve been able to accomplish in such a short time is a huge testament to how great Julia is! :nerd_face:)

1 Like

Just as there are aspects of the design of Julia that make it possible to achieve great performance, the problems with the current approach of String and the Char changes mean that it will never be possible to get the same sort of performance as I will be able to.

Sure, if you think it’s a good idea to spend months trying to work around all sorts of performance potholes, because of a faulty design, instead of using that valuable time to do something like memory buffers for both arrays and strings (which is what I had hoped Stefan was working on over the last year or so, after he mentioned it during on of his talks, and I asked him about it at ODSC back in 2016), go ahead, but I think it’s a huge waste of Stefan’s talent, quite frankly.

Customers are already evaluating on v0.6.1 (we only recently moved this last month to v0.6, from v0.5.2).
I had started testing the changes to make things v0.7/v1.0 compatible, when the #24999 bomb was dropped, drastically affecting performance, breaking code, not only of our code, but of packages that we depend on.

Whether or not something is done about this in Julia, which would help all Julians, I need to be able to make sure that we, Dynactionize, can move forwards - and at this point, that means avoiding as completely as possible String (which we already did) and now even Char (which is going to be much more difficult to handle, since there currently is no AbstractChar in base that people use to make sure their code is generic when dealing with characters, like they can do with AbstractString).

I think you (and @nalimilan) are missing the point here - as I said earlier, the UTF-32 representation is almost never used, the place you’d most likely see it is if you were processing text from Twitter (or one of the Asian chat services), that have tons of Emojis, and then it’s still only for those strings that happen to have one of those characters in them (which tend to be rather short also, when you have tweets, etc)

Otherwise, things are generally a good deal shorter using my UniStr than UTF-8.