Indexing strings by Unicode code point instead of code unit?

This is not correct. Julia string indices retrieve characters, not code units, it’s just that the indices are not consecutive:

julia> s = "αβγ"
"αβγ"

julia> collect(eachindex(s))
3-element Array{Int64,1}:
 1
 3
 5

julia> s[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)

julia> s[3]
'β': Unicode U+03B2 (category Ll: Letter, lowercase)

julia> s[5]
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

(Note that the return value of s[i] is a character represented by Char, a 4-byte object corresponding to a Unicode codepoint.) String iteration is also over characters:

julia> for c in s
           display(c) # pretty-print c
       end
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
'β': Unicode U+03B2 (category Ll: Letter, lowercase)
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

julia> for (i,c) in pairs(s)
           @show i, c, s[i]
       end
(i, c, s[i]) = (1, 'α', 'α')
(i, c, s[i]) = (3, 'β', 'β')
(i, c, s[i]) = (5, 'γ', 'γ')

In contrast, the code units (bytes for UTF-8) are retrieved by the codeunit function:

julia> codeunit(s, 3)
0xce

julia> codeunits(s)
6-element Base.CodeUnits{UInt8,String}:
 0xce
 0xb1
 0xce
 0xb2
 0xce
 0xb3

Substrings work similarly: you give them code-unit indices, but they still give you the whole string of Unicode characters:

julia> s[3:5]   # a copy
"βγ"

julia> SubString(s, 3:5)    # a view
"βγ"

So, the point is, once you convert your character offsets to Julia String indices, you are still working with characters.

6 Likes