Unexpected index of Unicode subscript `char` in `string`?

frankwswang · June 24, 2021, 8:41pm

It seems that in some cases the indexing of subscript char is wrong:

julia> "a₁"[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> "a₁"[2]
'₁': Unicode U+2081 (category No: Number, other)

julia> "α₁"[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

julia> "α₁"[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'₁'

julia> length("α₁")
2

I’m using Julia 1.6.1 by the way.

StefanKarpinski · June 24, 2021, 8:48pm

Indexes are to code units (bytes) not characters. This is addressed extensively in the string section in the manual:

https://docs.julialang.org/en/v1/manual/strings/

oxinabox · June 24, 2021, 9:08pm

There are many things you might want to index.

codeunit: basically the individual bytes
codepoint: a single character, may be multiple bytes in unicode
grapheme: a single display unit. For example is 1 grapheme but it is made up of 4 codepoints that when placed next to each over display as one ‍ ‍ ‍

Of all of these: only codeunit can be done in O(1) time.
The rest are O(n).

I don’t actually know how to index by codepoint, or grapheme.
For codepoint you can call collect first, which is OK, but not great.
For grapheme, I am not sure if it can be known without knowing things about the device displaying it.

frankwswang · June 24, 2021, 9:12pm

Now I understand the number of code unit of “α” is 2 instead of 1 like ‘a’:

julia> codeunits("α")
2-element Base.CodeUnits{UInt8, String}:
 0xce
 0xb1

julia> codeunits("a")
1-element Base.CodeUnits{UInt8, String}:
 0x61

which causes this behavior Thank you!

frankwswang · June 24, 2021, 9:36pm

A little relevant to this topic but I’d like to know is there a particular reason for findfirst to not return all the indices a char occupies when it has multiple code units?

julia> findfirst("α", "α1")
1:1

The reason I ask this is that I want to efficiently locate a char inside a string no matter how many code units it contains. I know I can definitely do something like this though:

findfirst("α", "α1")[1] : ncodeunits("α")
1:2

Thank you!

tomerarnon · June 24, 2021, 10:04pm

Because then that range can’t be used to get the character:

julia> "α1"[1:2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'1'
...

julia> "α1"[1:1]
"α"

ericphanson · June 24, 2021, 11:08pm

There’s Unicode.graphemes to iterate over them, and I guess one could collect that to index them.

frankwswang · June 25, 2021, 12:29am

If you want to get the location of the char, why not just search for it as a char, which gives you the starting index? Isn’t this the more appropriate usage?

julia> str = "α1"
"α1"

julia> i = findfirst('α', str)
1

julia> str[i]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

tomerarnon · June 25, 2021, 9:29am

The same problem would occur if searching for a longer string.

julia> findfirst("α,β", "α,β,γ,δ")
1:4

julia> "α,β,γ,δ"[1:4]
"α,β"

julia> sizeof("α,β")
5

It’s simply the convention that characters are indexed by their starting index, rather than by the entire index-range that they occupy. Overall, this leads to fewer inconveniences than the alternative (either the behavior you showed would have to change, or this one does, for consistency)

Topic		Replies	Views
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2350	January 12, 2024
Problems with strings which contain unicode characters on julia 0.7-DEV General Usage	2	524	October 6, 2017
Indexing Unicode Strings Internals & Design	10	1675	June 4, 2021
String indexing bug? General Usage bug , strings	4	789	April 8, 2022
String getindex problem? General Usage	4	862	June 20, 2020

Unexpected index of Unicode subscript `char` in `string`?

Related topics