String indices : byte indexing feels wrong

I am discovering Julia language (from C++, Java and Python background), and I really love several concepts of Julia, especially how Julia handles types.

But I consider the choice to “almost disallow” usage of indices with the default String type is wrong. I mean, the String “\u20ac\U168E0\u07F7” is of length 3, but to get the second character, you have to write s[4] and for the third character s[8]. s[2] for example, throws an exception. Terrible !

function test()
    s = "\u20ac\U168E0\u07F7"
    println(length(s))
    println(s[1])
    println(s[4])
    println(s[8])
end
test()

Sorry for basic question, I am at beginning of learning curve :

I suppose I can use another string type inside my own packages, that will handle indices the right way ?

But, a soon as I have to pass or receive strings from/to other packages, I see no good way to avoid the problem. Either, all strings will have to be converted on boundaries (performance issue). If everybody is using AbstractString, the issue remains because no one will ever know if the AbstractString that is passed is using the weird or a straight indexing. I end up thinking that, in the current situation, usage of string indices must be abandonned in Julia (except in the particular case where strings are “package private” usage).

I just discovered this concern no more than 30 minutes ago and I am right now in the process of brainstorming around this discovery. I am still very confused. Help appreciated …

1 Like

Take a look at the functions nextind, prevind, thisind.

And perhaps this blog post is useful to you: Dealing with strings in Julia, patterns and anti-patterns - Julia Community 🟣

Do you know the docs?
An excellent starter on strings is here:
https://docs.julialang.org/en/v1.9/manual/strings/

This has been considered carefully and discussed many times. See e.g. Substring function? - #27 by stevengj

For a more extensive discussion, see e.g. http://utf8everywhere.org/ … there are strong reasons to prefer the variable-width UTF-8 encoding for general-purpose string handling, and many modern languages have made the same choice (e.g. Swift, Go). It trades off efficiency on an operation you hardly ever need (“give me the n-th character”) for many other advantages.

But it is jarring in the first few days of using strings — the experience for new users is the most unfortunate tradeoff of a variable-width encoding! If there is a particular practical operation you are not sure how to do on Julia strings (e.g. looping over strings, searching strings, extracting slices of strings based on patterns, etc. are all easy once you get used to it), please feel free to ask (after searching the manual/web).

19 Likes

By the way, just thought I’d chime in with another option you might find useful: simply collecting the string to get a vector of characters:

julia> s = "\u20ac\U168E0\u07F7"
"€𖣠߷"

julia> v = collect(s)
3-element Vector{Char}:
 '€': Unicode U+20AC (category Sc: Symbol, currency)
 '𖣠': Unicode U+168E0 (category Lo: Letter, other)
 '߷': Unicode U+07F7 (category Po: Punctuation, other)

julia> v[1]
'€': Unicode U+20AC (category Sc: Symbol, currency)

julia> v[2]
'𖣠': Unicode U+168E0 (category Lo: Letter, other)

julia> v[3]
'߷': Unicode U+07F7 (category Po: Punctuation, other)

This is of course rather inefficient, but if you don’t care about performance it can be quite convenient.

7 Likes

Try

s = "\u20ac\U168E0\u07F7"
idx = collect(eachindex(s))
length(idx) # = 3
s[idx[1]]
s[idx[2]]
s[idx[3]]
3 Likes

@Mason
Thanks for this example: that is exactly what I was thinking. I think this way of handling strings by copying them in vectors will become a common practice in Julia.

@stevengj
There are a lot of very efficient algorithms developed in computer science research that deals with so called “strings”. They all consider “strings” to be directly indexable. For example, algorithms for efficient suffix arrays computation (Ge Nong et al. 2009: Linear suffix array construction by almost pure induced sorting, Data Compression Conference). It is just one example, you have thousands of such articles …

It is hard for me to anticipate whether adapting these algorithms to Julia Strings, using nextind, prevind, thisind, will be an easy or a hard task for a proficient Julia developer. It is also hard for me to ancicipate whether it will lead to good performances or not.

I think there are high probability developers will just use Vector to implement these kind of algorithms in Julia.

Here again, I can’t really anticipate what can be the actual performance impact of copying and of using Vectors. Of course, it may depend on the context.

So, I need to learn more and dig in the details, make experiments, …, before I can have a strong opinion on this topic.

Welcome to the Julia community @hlbnet. Since you are new here, consider searching existing packages for string (or text) manipulation. There is an organization where you may find packages that implement additional functionality not present in a built-in Julia installation:

Try to always search juliahub.com for packages before starting a new effort by yourself.

3 Likes

In UTF-8, you typically apply such algorithms to the code units (bytes) of the string, given by codeunits(s) in Julia, which are directly/consecutively indexable and act as ordinary vectors.

You don’t need to operate on Unicode codepoints (“characters”) for such calculations, especially since codepoints probably aren’t what you think. For example, "noël" has 5 characters, and "🏳️‍🌈" has 4 characters!

This was also explained in the utf8everywhere.org page that I linked above, whose introduction wrote:

Furthermore, we would like to suggest that counting or otherwise iterating over Unicode code points should not be seen as a particularly important task in text processing scenarios. Many developers mistakenly see code points as a kind of a successor to ASCII characters. This lead to software design decisions such as Python’s string O(1) code point access. The truth, however, is that Unicode is inherently more complicated and there is no universal definition of such thing as Unicode character. We see no particular reason to favor Unicode code points over Unicode grapheme clusters, code units or perhaps even words in a language for that. On the other hand, seeing UTF-8 code units (bytes) as a basic unit of text seems particularly useful for many tasks, such as parsing commonly used textual data formats. This is due to a particular feature of this encoding.

So, I wouldn’t go into this discussion assuming that there’s a glaring problem with the UTF-8 encoding that somehow hasn’t occurred to anyone until now. Most languages use variable-width encodings these days (including Java and C++), and UTF-8 has become a dominant encoding for text. Really, it’s possible to do things efficiently, but you just have to adjust your thinking about what is being indexed!

12 Likes

It gets even worse: the noel you wrote has 5, but when I copy-paste it into this text field, the "noël" I see here has 4 characters! Looks like we get lucky and Discourse doesn’t apply any normalization (but looks like my browser’s text entry box does), so you can even copy-paste the two strings into Julia to see for yourself.

4 Likes

I’m quite sympathetic to OP sentiment, and I think Discourse and other Julia forums should pay special attention to new Julians, as they represent the least biased ideas about barriers to adoption.

Aside, here is a quick-and-dirty way to access strings using codepoint indices:

using MappedArrays

codepoint_index(s) = isempty(s) ? error("Empty string not allowed") : let lastcpidx = 1, 
  lastbyte = 1, lastcp = s[1]
    mappedarray(i->begin
        if lastcpidx == i return lastcp; end
        if lastcpidx < i res = nextind(s, lastbyte, i-lastcpidx) ; end
        if lastcpidx > i res = nextind(s, 1, i-1) ; end
        lastcpidx, lastbyte, lastcp = i, res, s[res]
        return lastcp
    end, 1:length(s))
end

and with this:

julia> trickystr = "noël"
"noël"

julia> ts = codepoint_index(trickystr)
5-element mappedarray(var"#17#18"{String}("noël", Core.Box(1), Core.Box(1), Core.Box('n')), ::UnitRange{Int64}) with eltype Any:
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

julia> ts[5]
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

julia> ts[3]
'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)

If you want that and aren’t willing to do s[nextind(s, 0, i)], you might as well just do collect(s) to get a Vector{Char}. With anything more complicated than that, it’s better to just get used to UTF-8.

The big downside of UTF-8 is that it exposes you to some of the complexity of Unicode within the first five minutes of playing with strings, which I’m sympathetic to for new users.

On the other hand, with an encoding like UTF-16 (e.g. used in Java and Javascript, or in C/C++ for wchar_t and std::wstring on Windows), this just makes your bugs harder to find — if you work mainly with Western languages, most programmers don’t realize that they aren’t indexing characters until they (or more likely, their users) hit more unusual characters. e.g. "🐨" is a single Unicode codepoint (unlike the other examples above) but is two UTF-16 code units.

4 Likes

I consider that Java is wrong with its handling of unicode. Not to say you can’t handle unicode properly in Java, but it is beyond the capability of an average Java developer, that do not understand how wrong his code is. So, to my opinion, there is no point comparing Julia to Java in this topic (nor with usage of wchar_t or std::wstring in C++).

I think the good comparison is with Python.
Here is a piece of Python code:

s = "\u20ac\U000168E0\u07F7"
print(len(s))
print(ord(s[0]))
print(ord(s[1]))
print(ord(s[2]))

Prints this:

3
8364
92384
2039

Yeah, Python 3 makes a different choice that preserves that direct character indexing — it’ll dynamically transcode strings to whatever fixed-width representation is required behind the scenes for you in order to support it. But it’s also not without tradeoffs, both in its performance and semantics. I’ve struggled with UnicodeDecodeErrors and str-vs-bytes type errors in Python myself — and those are direct consequences of this behind-the-scenes transcoding that happens.

Direct character indexing is one of those things that you want when you first start looking at strings, and so it does hit newcomers hardest. But when you go to programmatically work with strings, you’re almost never hand-writing those indices — you’re using indices you got as output from findfirst or a regex match or iterating or the like… or you’re counting character display widths or grapheme clusters or…

The one common exception in my experience is the +/-1 operation — that’s where it is annoying to need to remember to use nextind and prevind. But even there, I have found I need to think carefully about whether I want the previous character or grapheme or whatnot.

3 Likes

The way Python3 handles this internally is that every string is transcoded to a fixed-width format based on the largest code point in the string. (Specifically one of Latin-1, UCS-2 or UTF-32—yes, the names of these are comically all over the place, but these are the three fixed-width encodings that represent Unicode code points directly as UInt8, UInt16 and UInt32 values). What is the problem with this? Here are the major issues:

  1. The Python interpreter has to read and decode every string before the user can do anything with it. For large data, you might not actually need to look at most of some data at all, so this can be a potential performance problem. E.g. you might want to mmap an entire large file as a string and then access only parts of it that you need without even looking at most of it. In Julia you can do this whereas in Python3 you cannot because the system will read the entire thing, and worse still, transcode the whole thing if there’s anything non-ASCII in there…

  2. Any UTF-8 string that contains non-ASCII data needs to transcoded at least once. Not only do you have to look at all string data, but if any of it isn’t ASCII, then you’ll need to expand the string into an up to 4x larger buffer than the original string. If you later print that string, you’ll need to transcode it back to UTF-8 in order to send the string anywhere external.

  3. As alluded to above, the blow up to represent a string can be up to 4x. How common is this? I’ve heard arguments made that high code points are very rare, which is mostly true, with the exception of one rather popular kind of character: emojii. Code points for emoji start at U+1F600, which is too big to fit in a UInt16. And of course, they’re constantly sprinkled into text that would otherwise be pure ASCII, which is basically the worst possible case for Python3: any otherwise-ASCII-text that contains even a single emojii ends up using four bytes per character, inflating it by 4x. Not ideal given how common emojii are in real world text data these days.

  4. Since Python3 represents strings as vectors of code point values, it cannot, by design, handle invalid string data. Is that a problem? In a perfect world, string data would all be valid and this would be fine, but unfortunately, that’s not the world we live in. CSV files regularly include invalid UTF-8 in fields even if the file is otherwise encoded as UTF-8. When you’re writing a server, you really don’t want your server to just die when someone sends a poorly formated request. File systems will happily give you invalid string data for path names—yes, this is real and it sucks, but both UNIX and Windows do this. How does Python let you handle this? It’s a mix between forcing you to work with byte arrays instead of strings and just not having any way to handle it. It’s extremely common for long-running data analysis processes in Python to just suddenly die deep in some work because they encountered a stray \xff byte. Any operation that gets external data and produces strings is susceptible.

A common defense of this behavior is to say that strings shouldn’t be allowed to be invalid in the first place—picture a strict Victorian school teacher wagging her finger at you. But amusingly, Python3 isn’t even disciplined about that since they do allow strings to be as invalid in some ways. Specifically, unpaired surrogate code units are allowed in strings even though they are invalid. You see, it’s only some kinds of invalid string data that are morally reprehensible and too dangerous to allow. Conveniently, the morally acceptable invalid strings are exactly the ones that Python happens to be able to represent and the morally repugnant ones are the kind that it can’t represent.

Anyway, why the long rant? Because it’s very annoying having people hold up Python3’s string handling as some paragon of excellence. Literally the only thing it has going for it is that once you have a string—you’ve already paid an O(n) scan and transcode cost—you can do O(1) indexing by character. Never mind that indexing by character isn’t particularly useful and almost never what you actually need: for efficiency, code units are better; for human percieved behavior, normalized grapheme clusters are what you should work with. Never mind that even if you don’t want or need to do O(1) indexing by character, you still are forced to pay the O(n) up front cost everywhere.

12 Likes

Yes, these are good technical arguments.

But I think the concern is more about adoption, and I am worried the current String implementation may be a barrier to adoption of Julia, knowing string are everywhere and the average skills of everyday developers.

I am thinking that it may be possible to keep the String implementation as-is, while adding some kind of wrapper, of optional use, that could provide a simpler indexing API on top of a basic UTF8 string. The main issue being to manage smooth transition between the two API (maybe, basic string could store a pointer to its wrapper if already computed, to prevent from creating it again and again).

Thank you all for your explanations. I have read much of the very interesting references you provided. It is much appreciated. Be sure I learn from all your answers.

3 Likes

In case it’s useful, IterTools.jl provides nth() which allows extracting nth(s,1), nth(s,2), etc.

To achieve the same result as the Python example above:

using IterTools
s = "\u20ac\U000168E0\u07F7"
n = length(s)
foreach(i -> println(Int(nth(s,i))), 1:n)
1 Like

To demonstrate codeunits, here’s an example.

julia> function test()
           s = "\u20ac\U168E0\u07F7"
           c = codeunits(s)
           println(length(c))
           println(c[1])
           println(c[4])
           println(c[8])
           display(c)
       end
test (generic function with 1 method)

julia> test()
9
226
240
223
9-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x82
 0xac
 0xf0
 0x96
 0xa3
 0xa0
 0xdf

Another point I want to make which I’m not sure if it is clearly made above is that you can iterate through characters in a String easily via a for loop.

julia> for c in s
           println(c)
       end
€
𖣠
߷

julia> eachindex(s)
Base.EachStringIndex{String}("€𖣠 ߷")

julia> for i in eachindex(s)
           println(i, ": ", s[i])
       end
1: €
4: 𖣠
8: ߷

Some of your question implies that you mainly want to address interface issues. Sometimes we overthink deeper issues, when people just want to customize the interface. Let’s create a type that wraps around a Vector{Char} as demonstrated above. The main advantage of making a wrapper type like this is that you can write your own convert methods for it without commiting type priacy.

julia> begin
           struct EagerStringChars <: AbstractVector{Char}
               s::Vector{Char}
           end
           EagerStringChars(s::AbstractString) = EagerStringChars(collect(s))

           # Implement minimal array interface:
           # https://docs.julialang.org/en/v1/manual/interfaces/#man-interface-array
           Base.IndexStyle(::Type{EagerStringChars}) = IndexLinear()
           Base.getindex(esc::EagerStringChars, i) = esc.s[i]
           Base.size(esc::EagerStringChars) = size(esc.s)

           Base.convert(::Type{String}, esc::EagerStringChars) = String(esc.s)
           Base.convert(::Type{EagerStringChars}, s::AbstractString) = EagerStringChars(s)
       end

julia> s = "\u20ac\U000168E0\u07F7"
"€𖣠 ߷"

julia> esc = EagerStringChars(s)
3-element EagerStringChars:
 '€': Unicode U+20AC (category Sc: Symbol, currency)
 '𖣠': Unicode U+168E0 (category Lo: Letter, other)
 '߷': Unicode U+07F7 (category Po: Punctuation, other)

julia> esc[1]
'€': Unicode U+20AC (category Sc: Symbol, currency)

julia> esc[2]
'𖣠 ': Unicode U+168E0 (category Lo: Letter, other)

julia> esc[3]
'߷': Unicode U+07F7 (category Po: Punctuation, other)

julia> strings = String[esc]
1-element Vector{String}:
 "€𖣠߷"

julia> push!(strings, esc)
2-element Vector{String}:
 "€𖣠߷"
 "€𖣠߷"

julia> string_chars = EagerStringChars[s]
1-element Vector{EagerStringChars}:
 ['€', '𖣠', '߷']

We can also create a lazy version of this. Each time we index a string, we will iterate through the string and return the nth character.

julia> begin
           struct LazyStringChars{S} <: AbstractVector{Char}
               s::S
           end

           Base.IndexStyle(::Type{LazyStringChars}) = IndexLinear()
           function Base.getindex(lsc::LazyStringChars, i)
               # consider IterTools.nth
               checkindex(Bool, 1:length(lsc.s), i) || throw(ArgumentError("Out of bounds index $i"))
               n = 0
               for c in lsc.s
                   n += 1
                   n == i && return c
               end
           end
           Base.size(lsc::LazyStringChars) = (length(lsc.s),)

           Base.convert(::Type{String}, lsc::LazyStringChars) = String(lsc.s)
           Base.convert(::Type{LazyStringChars}, s::AbstractString) = LazyStringChars(s)
       end

julia> lsc = LazyStringChars(s)
3-element LazyStringChars{String}:
 '€': Unicode U+20AC (category Sc: Symbol, currency)
 '𖣠': Unicode U+168E0 (category Lo: Letter, other)
 '߷': Unicode U+07F7 (category Po: Punctuation, other)

julia> lsc[1]
'€': Unicode U+20AC (category Sc: Symbol, currency)

julia> lsc[2]
'𖣠 ': Unicode U+168E0 (category Lo: Letter, other)

julia> lsc[3]
'߷': Unicode U+07F7 (category Po: Punctuation, other)

julia> push!(strings, lsc)
3-element Vector{String}:
 "€𖣠߷"
 "€𖣠߷"
 "€𖣠߷"

julia> lazy_string_chars = LazyStringChars[s]
1-element Vector{LazyStringChars}:
 ['€', '𖣠', '߷']
4 Likes