Problems with deprecations of islower, lowercase, isupper, uppercase

StefanKarpinski · December 22, 2017, 9:14pm

While that was a concern initially, this didn’t turn out to be true in the end. I did have to write some very tricky low-level implementations of functions, like this:

github.com

JuliaLang/julia/blob/5f14f11beb4c8acc15e6c4ec429f168bbf902a75/base/strings/string.jl#L154-L214


      
          @propagate_inbounds function next(s::String, i::Int)
              b = codeunit(s, i)
              u = UInt32(b) << 24
              between(b, 0x80, 0xf7) || return reinterpret(Char, u), i+1
              return next_continued(s, i, u)
          end
          
          function next_continued(s::String, i::Int, u::UInt32)
              u < 0xc0000000 && (i += 1; @goto ret)
              n = ncodeunits(s)
              # first continuation byte
              (i += 1) > n && @goto ret
              @inbounds b = codeunit(s, i)
              b & 0xc0 == 0x80 || @goto ret
              u |= UInt32(b) << 16
              # second continuation byte
              ((i += 1) > n) | (u < 0xe0000000) && @goto ret
              @inbounds b = codeunit(s, i)
              b & 0xc0 == 0x80 || @goto ret
              u |= UInt32(b) << 8

This file has been truncated. show original

Assuming validity of UTF-8 data doesn’t end up buying you much additional performance – if any. There are some more iterated index arithmetic functions I’ll get to writing optimized versions of after the 0.7 feature freeze. I’ll also do some more benchmarking against the previous UTF-8 code.

This was a big issue. Validating incoming data requires looking at all of it, which is not acceptable for large enough text data. And the validation was extremely spotty – some ways of getting strings would error if the data was invalid, other ways, you’d end up with a string holding invalid data and no error. So we were paying the price for validation, and not even getting any validity guarantee from it. Moreover, as I’ve said, the assertion that you can decode UTF-8 much faster by assuming it is valid seems not to be correct and would need to be backed up by some actually benchmarks to that effect (e.g. an implementation decoding UTF-8 assuming validity that is faster than my implementation above which doesn’t).

Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr means that not only do you need to look at every byte of incoming data, but you also have to transcode it in general. This is a total performance nightmare for dealing with large text files. This is also the reason why his benchmarks are extremely misleading: he’s comparing operations that are O(n) for variable-width encodings like UTF-8 but O(1) for fixed-width encodings like UTF-32. But how did you get that fixed-width encoded string data in the first place? You aren’t getting data in UniStr form – since that’s not an actual encoding that exists in the wild. So you had to scan each incoming string to find its largest code point value, and then transcode it to the appropriate choice of Latin-1, UCS-2 or UTF-32. After all of that work, sure, indexing and counting code points are O(1), but you already did the work that you’re timing the UTF-8 string type doing.

Note also that in UniStr, if a large string that is mostly ASCII has a single emoji in it, then it needs to be stored in UTF-32, so it will be 4x larger than it would be in UTF-8, for example. That’s an extreme example, but also not all that contrived – mostly-ASCII data with a few emoji is not exactly an unlikely scenario.

Are there use cases for the UniStr kind of hybrid encoding? Sure. If you want to ingest a bunch of string data once and the strings are going to each be fairly small (limiting the potential effect of a single emoji), then it might be a good way to represent strings. But that’s a fairly specific scenario and hardly a typical one for data processing.

Topic		Replies	Views
Changes to the representation of Char Internals & Design	14	2883	December 12, 2017
Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540] Internals & Design	29	4759	January 24, 2018
Performance of length(::String) Performance	24	3992	July 28, 2018
Problem processing non utf8 string New to Julia	17	2217	June 1, 2018
Strftime & strptime bug #27239 is present on all platforms, not just Windows Internals & Design bug	13	2809	June 3, 2018

Problems with deprecations of islower, lowercase, isupper, uppercase

Related topics