Unexpected index of Unicode subscript `char` in `string`?

It seems that in some cases the indexing of subscript char is wrong:

julia> "a₁"[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> "a₁"[2]
'₁': Unicode U+2081 (category No: Number, other)

julia> "α₁"[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

julia> "α₁"[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'₁'

julia> length("α₁")
2

I’m using Julia 1.6.1 by the way.

Indexes are to code units (bytes) not characters. This is addressed extensively in the string section in the manual:

https://docs.julialang.org/en/v1/manual/strings/

5 Likes

There are many things you might want to index.

  • codeunit: basically the individual bytes
  • codepoint: a single character, may be multiple bytes in unicode
  • grapheme: a single display unit. For example :family_woman_woman_girl_boy: is 1 grapheme but it is made up of 4 codepoints that when placed next to each over display as one :woman::woman::girl::boy:

Of all of these: only codeunit can be done in O(1) time.
The rest are O(n).

I don’t actually know how to index by codepoint, or grapheme.
For codepoint you can call collect first, which is OK, but not great.
For grapheme, I am not sure if it can be known without knowing things about the device displaying it.

6 Likes

Now I understand the number of code unit of “α” is 2 instead of 1 like ‘a’:

julia> codeunits("α")
2-element Base.CodeUnits{UInt8, String}:
 0xce
 0xb1

julia> codeunits("a")
1-element Base.CodeUnits{UInt8, String}:
 0x61

which causes this behavior Thank you!

A little relevant to this topic but I’d like to know is there a particular reason for findfirst to not return all the indices a char occupies when it has multiple code units?

julia> findfirst("α", "α1")
1:1

The reason I ask this is that I want to efficiently locate a char inside a string no matter how many code units it contains. I know I can definitely do something like this though:

findfirst("α", "α1")[1] : ncodeunits("α")
1:2

Thank you!

Because then that range can’t be used to get the character:

julia> "α1"[1:2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'1'
...

julia> "α1"[1:1]
"α"

There’s Unicode.graphemes to iterate over them, and I guess one could collect that to index them.

1 Like

If you want to get the location of the char, why not just search for it as a char, which gives you the starting index? Isn’t this the more appropriate usage?

julia> str = "α1"
"α1"

julia> i = findfirst('α', str)
1

julia> str[i]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

The same problem would occur if searching for a longer string.

julia> findfirst("α,β", "α,β,γ,δ")
1:4

julia> "α,β,γ,δ"[1:4]
"α,β"

julia> sizeof("α,β")
5

It’s simply the convention that characters are indexed by their starting index, rather than by the entire index-range that they occupy. Overall, this leads to fewer inconveniences than the alternative (either the behavior you showed would have to change, or this one does, for consistency)