Indexing strings by Unicode code point instead of code unit?

stevengj · February 14, 2021, 4:01pm

This is not correct. Julia string indices retrieve characters, not code units, it’s just that the indices are not consecutive:

julia> s = "αβγ"
"αβγ"

julia> collect(eachindex(s))
3-element Array{Int64,1}:
 1
 3
 5

julia> s[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

julia> s[3]
'β': Unicode U+03B2 (category Ll: Letter, lowercase)

julia> s[5]
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

(Note that the return value of s[i] is a character represented by Char, a 4-byte object corresponding to a Unicode codepoint.) String iteration is also over characters:

julia> for c in s
           display(c) # pretty-print c
       end
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
'β': Unicode U+03B2 (category Ll: Letter, lowercase)
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

julia> for (i,c) in pairs(s)
           @show i, c, s[i]
       end
(i, c, s[i]) = (1, 'α', 'α')
(i, c, s[i]) = (3, 'β', 'β')
(i, c, s[i]) = (5, 'γ', 'γ')

In contrast, the code units (bytes for UTF-8) are retrieved by the codeunit function:

julia> codeunit(s, 3)
0xce

julia> codeunits(s)
6-element Base.CodeUnits{UInt8,String}:
 0xce
 0xb1
 0xce
 0xb2
 0xce
 0xb3

Substrings work similarly: you give them code-unit indices, but they still give you the whole string of Unicode characters:

julia> s[3:5]   # a copy
"βγ"

julia> SubString(s, 3:5)    # a view
"βγ"

So, the point is, once you convert your character offsets to Julia String indices, you are still working with characters.

Topic		Replies	Views
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1411	December 5, 2023
Substring function? New to Julia strings , unicode	42	4010	July 18, 2022
StringIndex idea (Julia 2.0) Internals & Design strings , indexing	72	3345	March 27, 2024
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2313	February 12, 2018
SubString doesn't work with unicode New to Julia question , unicode	13	1446	June 17, 2022

Indexing strings by Unicode code point instead of code unit?

Related topics