It seems that in some cases the indexing of subscript char
is wrong:
julia> "a₁"[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> "a₁"[2]
'₁': Unicode U+2081 (category No: Number, other)
julia> "α₁"[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
julia> "α₁"[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'₁'
julia> length("α₁")
2
I’m using Julia 1.6.1 by the way.
Indexes are to code units (bytes) not characters. This is addressed extensively in the string section in the manual:
https://docs.julialang.org/en/v1/manual/strings/
5 Likes
Now I understand the number of code unit of “α” is 2 instead of 1 like ‘a’:
julia> codeunits("α")
2-element Base.CodeUnits{UInt8, String}:
0xce
0xb1
julia> codeunits("a")
1-element Base.CodeUnits{UInt8, String}:
0x61
which causes this behavior Thank you!
A little relevant to this topic but I’d like to know is there a particular reason for findfirst
to not return all the indices a char
occupies when it has multiple code units?
julia> findfirst("α", "α1")
1:1
The reason I ask this is that I want to efficiently locate a char inside a string no matter how many code units it contains. I know I can definitely do something like this though:
findfirst("α", "α1")[1] : ncodeunits("α")
1:2
Thank you!
Because then that range can’t be used to get the character:
julia> "α1"[1:2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'1'
...
julia> "α1"[1:1]
"α"
There’s Unicode.graphemes to iterate over them, and I guess one could collect
that to index them.
1 Like
If you want to get the location of the char
, why not just search for it as a char
, which gives you the starting index? Isn’t this the more appropriate usage?
julia> str = "α1"
"α1"
julia> i = findfirst('α', str)
1
julia> str[i]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
The same problem would occur if searching for a longer string.
julia> findfirst("α,β", "α,β,γ,δ")
1:4
julia> "α,β,γ,δ"[1:4]
"α,β"
julia> sizeof("α,β")
5
It’s simply the convention that characters are indexed by their starting index, rather than by the entire index-range that they occupy. Overall, this leads to fewer inconveniences than the alternative (either the behavior you showed would have to change, or this one does, for consistency)