Problems with deprecations of islower, lowercase, isupper, uppercase

Some more results:

Russian.txt: 5427 lines, 481891 characters
[100030, 469, 381392, 0]
[0, 3, 0, 5424, 0]
  String:      Bytes: 865360     Chars: 481891        1.796 bytes/char
    sizeof:                      865360      9.082µs      1.674ns      0.019ns      0.010ns
    length:                      481891   1639.185µs    302.043ns      3.402ns      1.894ns

vs.
  UniStr:      Bytes: 605856     Chars: 481891        1.257 bytes/char
    sizeof:                      605856     36.371µs      6.702ns      0.075ns      0.060ns
    length:                      481891    147.318µs     27.145ns      0.306ns      0.243ns

That is even with suffering the effects of an extra indirection through a pointer not having all the new Union optimizations.

Japanese.txt: 2037 lines, 292885 characters
[1590, 0, 291295, 0]
[0, 12, 0, 2025, 0]
  String:      Bytes: 875475     Chars: 292885        2.989 bytes/char
    sizeof:                      875475      2.511µs      1.233ns      0.009ns      0.003ns
    length:                      292885    900.173µs    441.911ns      3.073ns      1.028ns

vs.
  UniStr:      Bytes: 585302     Chars: 292885        1.998 bytes/char
    sizeof:                      585302      9.656µs      4.740ns      0.033ns      0.016ns
    length:                      292885     44.510µs     21.851ns      0.152ns      0.076ns

Right now, it’s already 20x faster, at something as frequently used as length, and judging by other results, once the extra indirection and union dispatching is dealt with, should be at least 50x faster than with String.

When I started to evaluate Julia (fast compiled language) I ported my small script (nothing important, just toy script) from python (slow interpreted language) and it was 2 times slower. I really tried to play with performance optimization.

I think there was (are?) 2 problems: 1. python has faster generators and 2. python has/had faster strings.

For my script - using UTF-32 (if I needed just 1.5x more memory for 4x faster code) could be useful!
I have no ambition to choose best string implementation! I was only trying to say that in that case if I would have two possibilities: 1. UTF-32 and 2. 0.5 or 0.6 Julia strings - I would use UTF-32 and would be glad.

I think that Julia need to have battery for more efficient strings if we like to convince more python people. And we don’t need one hammer for all nail types :slight_smile:

Nobody disagrees with that. We absolutely support using packages to provide UTF-32 strings, and indeed they exist. I don’t understand what the confusion is here. We’re not saying everybody should use only one string type ever, it’s just that we need to pick a default to provide, just like every other modern high-level language.

Was this a script where everything was done at the top level? That could explain it. We have some benchmarks where python is faster with strings, so I understand that, but surely python generators aren’t faster in every case. I would really appreciate it if you can post a benchmark where python generators are faster.

Not true: please see the definition of textwidth, in base/strings/unicode.jl, line 272 on master.

function textwidth(c::Char)
    ismalformed(c) && (c = '\ufffd')
    Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c))
end

But the String type silently allows in garbage, and you can’t tell later that it’s there. That is very different from what I’m doing.

When I made my Julia-lite branch two years ago, I stripped out even the REPL, and it was definitely not the “tiniest imaginable” amount of time (especially on my 32-bit Raspberry Pi) when I measured it.

I don’t think source in other encodings is necessary - I was just pointing out why we had to have the identifier tables table driven back 29 years ago when I first started dealing with NLS (national language support) issues.
The benefit to Julia (and it isn’t really much work at all, and I’d do it if it weren’t such a pain now to get PRs into Julia itself for me) is making Julia independent of Unicode versions, but especially, to be able to update executables written in Julia, by changing the library to one with the new Unicode version info compiled for that particular Julia version.

It’s not the parser tables, it’s the tables loaded as Dicts for the REPL. I just said that the parser tables should be done in the same way, not that they were the reason for doing it.

Not really - what I’d like to see is the single “core” type (with 4 subtypes), that to users would be seen pretty much as a single immutable string type, that has O(1) indexing, ability to pass through (and inspect!) invalid data via Raw string types, without losing the ability to have much faster code that deals only with valid Unicode data, and faster UTF8Str, UTF16Str, and UTF32Str, to be able to interface with other things like databases, languages or file formats that do want UTF-8, or UTF-16 for Windows, or UTF-32 for Unix wchar.

All of the encoding support (which I’m also working on) and will have finished soon, would definitely be a package. UniStr (hopefully renamed as String) would be what people pretty much use directly, it’s just the union type or a wrapper for the union type, if that can be done efficiently. Something like Str{enc"CP1252"} would have to be loaded from stdlib or an external package.

People would happily not have to deal with issues of “José bebe café” not being able to be sliced as str[6:9], to get bebe, and getting errors because of indexing into the middle of characters.
Also, in generally taking less memory in most of the world would be nice when trying to scale large NLP projects.
I honestly think many Julians would rejoice.

I am well aware of the tradeoffs - I had to keep large customers all over the world happy over decades with support for different character sets and Unicode.
What concerns me is, where are the large range of benchmarks and scenarios done to show that #24999 was somehow better than before, with Char being the same as the Unicode codepoint? Do you have any Gists that of results of benchmarking that was done? Benchmarking done with sample sets that aren’t 100% ASCII? (all of the minimal results that Stefan has pointed to so far have been done on just a dictionary file, not the sort of thing you’d be processing for NLP work, which is entirely composed of ASCII characters).

Well, I also speak with our team in India (in Kochi, Kerala) on a daily basis, and we are doing work with NLP, text analysis of unstructured and semi-structured data, and databases.

Currently, Julia is still in a bit of a niche, with scientific/technical/numeric users.
Just as I almost turned away when I first evaluated Julia, seeing how poorly it did with strings, both performance wise compared to other languages such as Python 3 or CachéObjectScript, bugs in handling Unicode, and lack of functionality for dealing with things like string encodings, I imagine most people who need a good tool for string processing don’t stick with Julia, so you just don’t hear from them.

When somebody like me tries to say that while UTF-8 is great for certain things, but is maybe not the best thing for a technical language where you might want to be doing text analysis, they end up getting shouted down, being told that they don’t know what they are talking about, that “UTF-8” “won”, and to shut up.

3 Likes

I understand how, if you’re working with UTF-32 or raw code points, the new Char type can be slower due to extra encoding/decoding work. I’m less clear on what the performance problems with String are. As I understand it:

  1. length and codepoint indexing are O(n)
  2. It’s larger than latin-1 or UCS2 for some data
  3. Some functions are slower due to not being able to assume valid UTF-8

Am I missing anything?

For UTF-8 validation, the problem we hit was the time taken to do validation e.g. when reading each line of a file. I know things can be faster once you’ve done validation, but we just couldn’t get competitive performance for readline when doing it.

For point (1), while obviously you need length and nth-codepoint sometimes, our belief is that this is uncommon. It’s weird for code to contain str[6:9]; usually the 6 is derived by searching for something. I also feel that indexing and counting e.g. graphemes is fairly common, in which case O(1) codepoint access doesn’t help much.

3 Likes

First confusion is very probably my fault. Sorry! Lack of possibility to use UTF-32 in that time was because I didn’t know there is this possibility. (does doc say something about it?) It was my first try to write something in Julia!

In previous message I just wanted to emphasize my preference to pay 1.5 more memory usage for 400% speed gain and that I had real test case where my evaluation of Julia failed because slow generators (and/or slow strings)

On generator problem (described here) @-StefanKarpinski wrote: Not much optimization work has been done for this kind of usage of generators because if you really care about high performance you wouldn’t use generators like this in the first place. It would be nice to eventually make this faster, but it’s not a super high priority.

We don’t need to focus on generator performance in this phase of Julia’s development cycle.
Although I really like to see improvement in 1.x I rather want to help not to distract you from important actual work!

1 Like

This is really misleading. We don’t substitute \ufffd in a way the user would notice; it’s just for purposes of calling charwidth internally. It doesn’t replace or discard data in any way, since this function returns an integer.

Of course you can tell it’s there, by looking at the data or, now, the Chars that come out. What am I missing?

The whole REPL code, yes. I was just talking about the emoji/latex tables.

1 Like

Generally, when you read something in, you want to do something with it.
Paying the penalty once, instead of every time a character is accessed, usually far outweighs the cost of doing the validation.

Also, just how was that code written that you say made readlines too slow, and for what version of Julia was it?

The conversion code that I wrote for UTF-8 etc. (which got moved to LegacyStrings) was a lot faster doing full validation (and even handling the odd UTF-8 variants optionally) than the old UTF-8 I replaced, which then allowed being able to do optimized operations in most cases.
Also, I hadn’t even begun to optimize that validation code… since depending on where you are, frequently UTF-8 has lots of ASCII, in my old UTF-8 handling code, I’d do that with SIMD instructions, and even without SIMD instructions, would do it 8 bytes at a time.

We’re using u8_isvalid in src/support/utf8.c. It would be great if that can be hugely faster, but I’m pretty sure even just iterating over the data will make you lose a readline benchmark. Yes, the equation changes if the benchmark includes other string operations. But there are lots of common operations whose performance is not really affected by validity checking. For example, splitting lines on ASCII delimiters.

While that was a concern initially, this didn’t turn out to be true in the end. I did have to write some very tricky low-level implementations of functions, like this:

Assuming validity of UTF-8 data doesn’t end up buying you much additional performance – if any. There are some more iterated index arithmetic functions I’ll get to writing optimized versions of after the 0.7 feature freeze. I’ll also do some more benchmarking against the previous UTF-8 code.

This was a big issue. Validating incoming data requires looking at all of it, which is not acceptable for large enough text data. And the validation was extremely spotty – some ways of getting strings would error if the data was invalid, other ways, you’d end up with a string holding invalid data and no error. So we were paying the price for validation, and not even getting any validity guarantee from it. Moreover, as I’ve said, the assertion that you can decode UTF-8 much faster by assuming it is valid seems not to be correct and would need to be backed up by some actually benchmarks to that effect (e.g. an implementation decoding UTF-8 assuming validity that is faster than my implementation above which doesn’t).

Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr means that not only do you need to look at every byte of incoming data, but you also have to transcode it in general. This is a total performance nightmare for dealing with large text files. This is also the reason why his benchmarks are extremely misleading: he’s comparing operations that are O(n) for variable-width encodings like UTF-8 but O(1) for fixed-width encodings like UTF-32. But how did you get that fixed-width encoded string data in the first place? You aren’t getting data in UniStr form – since that’s not an actual encoding that exists in the wild. So you had to scan each incoming string to find its largest code point value, and then transcode it to the appropriate choice of Latin-1, UCS-2 or UTF-32. After all of that work, sure, indexing and counting code points are O(1), but you already did the work that you’re timing the UTF-8 string type doing.

Note also that in UniStr, if a large string that is mostly ASCII has a single emoji in it, then it needs to be stored in UTF-32, so it will be 4x larger than it would be in UTF-8, for example. That’s an extreme example, but also not all that contrived – mostly-ASCII data with a few emoji is not exactly an unlikely scenario.

Are there use cases for the UniStr kind of hybrid encoding? Sure. If you want to ingest a bunch of string data once and the strings are going to each be fairly small (limiting the potential effect of a single emoji), then it might be a good way to represent strings. But that’s a fairly specific scenario and hardly a typical one for data processing.

5 Likes

Not doing that and changing in 1.x to an validation with possible exception would be a breaking change?

I think we should default to an validation on input, and we could relax this later if we want, e.g. with a CLI switch.

I’m conflicted here, would we want garbage in, and “same content” (or exactly same, say when implementing cat) garbage out, instead of the exception?

The validation and thus the possible exception could be deferred, as an option. And implementing that option needs not be done before 1.0. Just having the option to skip validation entirely, should be easy to implement right away, but can also wait for 1.x.

Seems bad, but please don’t add an (explicit, not subtype) ASCII type, or any legacy-1-byte encoding to Base. It might be too tempting for people to use, locking your code out of Unicode. Rather have UTF-8, with ASCII subset, included as the go to default.

Some:

[implementation] details below (that can wait for 1.x) for @ScottPJones:

That seems likely, but I can think of cases when at least some strings read in, are never used again…

And often string handling is trivial, i.e. no “processing” only output later. If we want to at least validate on output, we might do it right away on input, yes. Or do we want to do it lazily (the “deferred” option above), and marks strings validated? Or maybe even have the string keep how many bytes have been validated?

I can think of a way to intentionally allocate 2x RAM for strings as a performance optimization…

I’m not convinced it is. Or that it needs to be. When all ASCII then branch prediction is pretty good. If you occasionally have one extra byte following I can see your point, but you would have to be doing some real processing, not just input and then later output. And then it would be I/O bound not CPU bound anyway.

With East Asian texts you might have this problem, but you would still have it (once) if your input in UTF-8 and you need to migrate to UCS-2?

I think I have a way to get anyone happy, can I post a proposal issue to your Github for your package? I’m not sure it fits best at Julia’s Github, at least yet. If I or anyone got around to implementing my novel (or not?) idea (e.g. O(1) indexing on multibyte “mutable” strings), then it would be more convincing.

I’m not sure that should be the key benchmark, but rather with some real processing.

[/quote]

I’ll look it up, not sure Char was ever good enough in Julia, not adopting Swift’s solution; your medium link was interesting, with:

“Swift’s Character type is designed to represent that “human-readable character” concept and would treat those final two code points as a single character, which also makes it more attractive to use.”

Not following.

I’m not sure what “respect Unicode” means. It can represent any Unicode code point, which ought to be good enough. The fact that it can represent other things too doesn’t do any harm.

Yes, as I understand it Swift’s Char type represents a grapheme cluster. That’s an interesting choice but I don’t think it’s mandated by Unicode in any way; code points are still a reasonable unit for processing text.

We have thought about it a lot, and I really think getting an exception here would be very annoying and unhelpful. Handling the exception would be awkward, so you’d actually need to arrange to avoid it in the first place, e.g. reading everything as a RawByteStr. But, that would require an extra step before anything useful could be done with the data. Alternatively, we could return either a UTF8Str or a RawByteStr based on the validation check. That would be a reasonable design, but there is a significant performance cost to switching between types. Maybe with the union type optimizations we have now it’s worth revisiting.

But really, what Base.String is designed for is “mostly UTF-8” data. If every UTF-8 file in the world were completely valid, we’d just have UTF8Str. Or, if you have data that’s nowhere near UTF-8 you need to do something else entirely. The first scenario obviously isn’t realistic. The second scenario we explicitly punt to packages. That leaves us with needing to handle e.g. UTF-8 files with a bit of latin-1 or random junk mixed in. It’s not helpful to slap the programmer on the wrist because a line of UTF-8 had one latin-1 character. Hopefully this explains some of the thinking behind Base.String.

3 Likes

Yes, that would be a breaking change. But we already tried this for two years and it had really poor usability. The trouble is that you want to read data in and then be able to check if it’s valid or not. If String isn’t allowed to hold invalid data, then you have to do everything with byte vectors first and only deal with strings once you’re sure you’ve got clean data. This leads to a perverse situation where packages that do heavy string processing like CSV and TextParse have to avoid using the String type in order to be robust.

In the new design, you can work with Strings that wrap invalid data no problem. If you want to know if a string or character is valid, just call Unicode.isvalid on it. If you want to replace invalid characters, just do this:

Unicode.isvalid(c) || (c = '\ufffd')

If you want to ignore invalid characters, just do

Unicode.isvalid(c) || continue

If you want to raise an error… you get the idea. It’s simple, easy to understand, and doesn’t impose a validity or transcoding tax on you unless you want it. There are some operations that do throw errors for invalid data. If you try to convert an invalid character to a code point value, for example, that’s an error – since there’s no well-defined answer.

I’m conflicted here, would we want garbage in, and “same content” (or exactly same, say when implementing cat) garbage out, instead of the exception?

Like I said, we tried it @ScottPJones’s way for two years and it has truly awful ergonomics for real-world data processing – this is not hypothetical or just my opinion, the data ecosystem has been struggling with it badly. Fundamentally, you need a string type that lets you choose whether to throw an error or not. You can wrap that in a stricter string type that always validates, but you can’t implement the relaxed version in terms of the strict one. So Base Julia gives you the more general, non-strict version and leaves strict Unicode enforcement to packages.

Seems bad, but please don’t add an (explicit, not subtype) ASCII type, or any legacy-1-byte encoding to Base. It might be too tempting for people to use, locking your code out of Unicode. Rather have UTF-8, with ASCII subset, included as the go to default.

Don’t worry, we’ve been there and it was bad. We’re not going to do that again. We might have an ASCII module with ASCII-only versions of things like case transformation. If you only want to do a simple ASCII uppercase transform, you could then call ASCII.uppercase(s) and it would uppercase ASCII characters and leave others alone, which is a very fast, simple operation in many encodings.

Paying the penalty once, instead of every time a character is accessed, usually far outweighs the cost of doing the validation.

That seems likely, but I can think of cases when at least some strings read in, are never used again…

On the contrary, while this premise and point of view makes lots of sense in @ScottPJones’s line of work – building textual databases, where data is loaded once and queried many times – this is actually quite atypical. Most of the time, a program scans through a text file, computes derived values and then throws the text away. In such use cases, being forced to do lots of work up front, only to never look at those strings again, is a total waste of time and effort.

As a general rule of thumb for high-performance computing, you don’t want to do work speculatively, you want to be as lazy as possible. @ScottPJones is trying to impose a very specific world view on the entire Julia language and ecosystem: he happens to work in an area where it does make sense to do more work pre-processing your text data so that accessing it later is faster. That’s fine and it can definitely be supported with packages, but it is not the norm. There is also an asymmetry here: if the default behavior is not to do unnecessary work, you can always opt into doing more work. But if you’ve already done the work because that’s the fundamental built-in behavior, you’re out of luck – you can’t undo work that’s already been done.

18 Likes

I tried to follow this thread and there seem to be two issues raise by Scott

  • why is is the module in stdlib called Unicode although it contains function that may operate on other Strings
    => Maybe he is right here and this module should be called String. Especially if not all Unicode strings are supported.

  • He criticized that modules in base or stdlib are somewhat more “first class” and are loaded faster.
    => This might be the case but it is a common agreement that this is the route forward. Currently it requires building custom system image to end up at the same speed but we all hope that it will more simple in the future to actually get binary precompiled packages. Its pretty unlikely that packages are moved to base in order to make them load faster.

2 Likes

Actually, that isn’t such of a problem, because that just effects load time, I was actually talking about not being able to get the same performance as String, because String has special hooks into the C code, and avoids having to have an extra object, and follow a pointer indirection on every access, which means that it is not really comparing apples to apples when comparing the performance of strings in my package, to the ones with special help.

Code in stdlib without any special C code support would have exactly the same issue.

that is a fair point. But couldn’t this be solved with a minimal hook in Julia core? It is not the first hook for an external package that I have seen (i.e. SnoopCompile.jl uses a hook)

Please stop misrepresenting things totally. Many people here may not be aware of the facts of the situation
(for which I have ample evidence).

I didn’t change anything AT ALL in the way strings were handled prior to my starting to contribute to Julia back in April 2015, except for fixing (some) of the many bugs, and greatly improving the performance of conversions.

In v0.3.x, you had ASCIIString, UTF8String, UTF16String, and UTF32String.
See the following definition: https://github.com/JuliaLang/julia/blob/release-0.3/base/utf8.jl#L163, i.e.

convert(::Type{UTF8String}, a::Array{Uint8,1}) = is_valid_utf8(a) ? UTF8String(a) : error("invalid UTF-8 sequence")

The philosophy then was that if you converted something to a UTF8String, it was checked for validity.
I did not change that one bit.
I did fix bugs: such as #10919 (my very first Julia PR), also found a very serious problem in #10958, in my first few weeks after I first saw Julia.

@stevengj said at the time, about #10958:

Whether we should accept (and silently convert) modified UTF-8 to standard UTF-8 is a separate issue; I tend to agree, but let’s keep that out of this discussion. After reading the RFCs, I agree that we shouldn’t produce the overlong NUL encoding ourselves

which Jeff also agreed with.

Also: Steven brought up the following back then, which may still be a problem:

Some of the functions in utf8.c seem to assume valid UTF-8, which may not be produced e.g. by bytestring(ptr, len).

Other string related things I fixed that were included in the v0.4 release:

and added a lot of unit tests (char and string functions had been very poorly covered previously):

Many record formats are defined with fields of fixed numbers of characters, padded with ’ ’ or 0s.
Also, for certain types of text analysis, you want to look at the distance in characters between two words or concepts in a sentence (having one type of ‘i’ take 1 byte, and another 3 bytes, totally screws up those sorts of calculations, and it’s not worth the performance hit there to make a little more accurate by looking by first doing decomposition/composition and calculating using grapheme clusters (something I had to be convinced about myself by an expert on computational linguistics, I thought using graphemes would be necessary, but Unicode codepoints are fine).

1 Like

Wow, I don’t quite understand the subject, but that is a quite big increase in machine code!

Why is this message hidden mods?

2 Likes

@ScottPJones I get your passion for strings, I really get it. I get your passion for text processing performance. But I don’t see it as something that needs to be in Base. In fact, I think a lot of stuff there doesn’t need to be there.

For strings, there has to be some kind of string thing in Base. For most people, strings are just something you make to print a message. Some super simple transformations and indexing is required. Fully performant text processing doesn’t seem to be in that purview, and that’s fine. As a set of packages, you can build every type for every encoding, make it super performant, benchmark it to death, and show it should be the choice for text processing. Great!

The only issue there is the performance drop from not having a hook into what makes strings fast. So what about just focusing on that? Nothing else to do with Base strings since those have a different focus. Just focus on getting the question solved of what’s required to get efficient strings outside of Base, and then use that to build your string empire.

I’m not sure you and @stefan will ever agree on everything related to strings, but that’s fine. The best way to show that you know strings is to build a bunch of successful string packages. Set the bad blood aside and find a way to move on. I probably wouldn’t use it because Base strings do println("this string") well enough for me, but I agree in some larger sense “it would be nice to have, but it’s not for me”.

7 Likes