Indexing strings by Unicode code point instead of code unit?

I am very new to Julia and would like to find out if I want to use it for implementing a library.
For this, I need to deal with Unicode strings and especially offset ranges of substrings a lot. In this context, I would need/want to represent ranges of Unicode characters, i.e. ranges of Unicode code points. However, as I understand, the default indexing of strings in Julia accesses UTF-8 code units, not code points.

This is different from Python, where indexing actually accesses the code points.

My question is: is there some kind of library or alternate string implementation that already deals with Strings properly as sequences of Unicode code points, irrespective of the encoding used? I do not care what representation is used internally, I do care about a string/text being a sequence of code points (even if that may make some string operations a little bit slower or, more like, use more memory for mapping arrays or the like).

UTF-32 strings from GitHub - JuliaStrings/LegacyStrings.jl: Legacy Unicode string types should give you what you want. (Python stores strings in UTF32)

2 Likes

Interesting – do you happen to know if it is possible to not just convert between the UTF32 representation from this library and the default Julia strings back and forth, but more importantly also convert the offsets?

Actually, even more importantly: does this library actually literally convert the string representation? Because I would not actually want that – All I need is s “view” of a Julia string (no matter what the representation) which allows me to index Unicode codepoints.

My issue with the julia strings is not the representation - I do not care about it really, but the way how one indexes/retrieves the elemets of the string: the Julia design decision apparently was to use indexing to retrieve the elements of representation, not the actual elements a string has: characters.
But it should be possible to at least ADD a way to do this by providing a “view”.

So codepoint(juliastring, 3) or similar should give me the actual third codepoint, which could be the 5th code unit in Julia. That should be achievable by simply creating an index map from code point indices to code units that are the beginning of code points (and another one for the other way).

I guess I would be prepared to actually implement this though it looks like something so basic that it feels like it should exist :slight_smile:
(on the other hand it does not exist in Java either where strings are represented using UTF-16 code units with surrogate characters).

You can create this mapping with points = collect(eachindex(str)). This has a big cost of course so in most cases you’d avoid that and use eachindex directly…

2 Likes

Thanks, this actually looks exactly like what I need! You say it has a big cost, can you be more specific? My assumption would be that the cost (time and memory) is linear in the number of code units of the string, is that true? Is there anything else to be worried about (e.g. handling slow exceptions internally or other things that could put a high constant factor on that linear cost?)
If the cost is basically what going through all the code units of the string in some other way would be, that is fine, the same is needed to map offsets e.g. between Java UTF16 code units and Python unicode code points.

Where are you getting the ranges of Unicode characters from? In most kinds of string-processing operations (e.g. search/replace), you find ranges of characters by first iterating over the string, in which case you also get the indices (regardless of the indexing scheme).

It’s pretty rare to have an application in which someone says “give me the 527th character of this string” where the number 527 just falls down out of the sky. (This is a key reason why a variable-width encoding like UTF-8 can be so popular.)

When you iterate over a Julia string (for any AbstractString), you indeed get a sequence of Unicode codepoints, regardless of the encoding. If you loop over pairs(somestring) then you get both the codepoints and the corresponding indices. There are also nextind and prevind for more low-level iteration.

Note that you can also get the index of the n-th character in a string s via nextind(s, 0, n), although this is an O(n) operation in a variable-width encoding like UTF-8.

In the unlikely event that you really need a fixed-width encoding, you can use UTF32String from LegacyStrings.jl.

3 Likes

For the problem I want to solve it is actually the other way round: I do have a Unicode string (in some encoding originally, but ultimately it will be a “normal” Julia string) and a large list of offset ranges, where the offsets are 0-based Unicode code point indices.
Now in order to access the string subsequences that correspond to those offset ranges, I need to first convert these offset ranges into the proper offset ranges for Julia strings, and I think the suggestion of creating a map using collect(eachindex(str)) is probably the solution to this.

Thanks for pointing this out! I guess Julia’s compromise of treating strings somewhat like sparse arrays (I assume), is a good compromise between proper handling of Unicode and still being able to efficiently access code points.

Thank you @sudete and @stevengj I think my problem is solved and I have a rough idea of how this could work! Looking forward to learn Julia properly now :smiley:

Yes I just meant the cost of allocating the memory and processing the whole string to populate the mapping. It’s a large cost relative to looping directly on the string (so you process it only once and without allocation).

Assuming performance is a concern in your application, why not keep the original encoding? If it’s not UTF-8 you can probably maybe¹ use direct indexing in the original string without a mapping…

¹ I forgot about UTF-16 surrogates, and as mentioned by @stevengj UTF-16 is a common case of non-UTF-8 text. (I don’t know if there’s another likely source of text with non-constant codepoint size?)

Out of curiosity, where are these offset ranges coming from in your application?

(Beware that if they are coming from another program that uses the UTF-16 encoding, e.g. in JavaScript or Windows C++, they might be indices of UTF-16 code units, which are almost codepoint indices but differ for relatively rare cases—unfortunately this just makes bugs harder to find but no less severe. This came up for me as a Jupyter bug once, for example.)

2 Likes

These are representations of texts with stand-off annotations for natural language processing and can ultimately come from many different sources. I have already worked for a long time with these in both Java (where the offsets refer to UTF16 code units with surrogate pairs representing what cannot be represented in 16 bits) and Python (where string indices directly access Unicode code points).

If you have a text with a sequence of ranges that are ordered monotonically by codepoint, then you can convert them to UTF-8 (Julia String) indices (or whatever data structure is desired) by a single pass over the text and indices, without allocating an explicit index map. (In any case, I would recommend doing the conversion once when they are read in, then for subsequent processing use native indices.)

2 Likes

This is not correct. Julia string indices retrieve characters, not code units, it’s just that the indices are not consecutive:

julia> s = "αβγ"
"αβγ"

julia> collect(eachindex(s))
3-element Array{Int64,1}:
 1
 3
 5

julia> s[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

julia> s[3]
'β': Unicode U+03B2 (category Ll: Letter, lowercase)

julia> s[5]
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

(Note that the return value of s[i] is a character represented by Char, a 4-byte object corresponding to a Unicode codepoint.) String iteration is also over characters:

julia> for c in s
           display(c) # pretty-print c
       end
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
'β': Unicode U+03B2 (category Ll: Letter, lowercase)
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

julia> for (i,c) in pairs(s)
           @show i, c, s[i]
       end
(i, c, s[i]) = (1, 'α', 'α')
(i, c, s[i]) = (3, 'β', 'β')
(i, c, s[i]) = (5, 'γ', 'γ')

In contrast, the code units (bytes for UTF-8) are retrieved by the codeunit function:

julia> codeunit(s, 3)
0xce

julia> codeunits(s)
6-element Base.CodeUnits{UInt8,String}:
 0xce
 0xb1
 0xce
 0xb2
 0xce
 0xb3

Substrings work similarly: you give them code-unit indices, but they still give you the whole string of Unicode characters:

julia> s[3:5]   # a copy
"βγ"

julia> SubString(s, 3:5)    # a view
"βγ"

So, the point is, once you convert your character offsets to Julia String indices, you are still working with characters.

6 Likes

@stevengj yes, sorry, I noticed this already from earlier explanations but did not correct myself, thanks for the clarification. I have learned that this is really more like a partial function or sparse array of actual proper Unicode characters.

It’s increasingly not rare, as I think you may know, e.g. all emojis use surrogates. Such as :slight_smile: not to be confused with emoticons like :), which if written that way and displayed that way is only ASCII (however many editors, like on Discourse change emoticons to emojis. So in practice e.g. otherwise English-only text, needs full variable length UTF-16 (as opposed to UCS-2). Such as SMS/text messaging. According to spec it uses UCS-2, but it’s known to use UTF-16 for that reason. Also non-BMP, i.e surrogates in UTF-16 used for “some modern scripts like Osage, Warang Citi, Adlam, Wancho and Toto.”