This is not correct. Julia string indices retrieve characters, not code units, it’s just that the indices are not consecutive:
julia> s = "αβγ"
"αβγ"
julia> collect(eachindex(s))
3-element Array{Int64,1}:
1
3
5
julia> s[1]
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
julia> s[3]
'β': Unicode U+03B2 (category Ll: Letter, lowercase)
julia> s[5]
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)
(Note that the return value of s[i]
is a character represented by Char
, a 4-byte object corresponding to a Unicode codepoint.) String iteration is also over characters:
julia> for c in s
display(c) # pretty-print c
end
'α': Unicode U+03B1 (category Ll: Letter, lowercase)
'β': Unicode U+03B2 (category Ll: Letter, lowercase)
'γ': Unicode U+03B3 (category Ll: Letter, lowercase)
julia> for (i,c) in pairs(s)
@show i, c, s[i]
end
(i, c, s[i]) = (1, 'α', 'α')
(i, c, s[i]) = (3, 'β', 'β')
(i, c, s[i]) = (5, 'γ', 'γ')
In contrast, the code units (bytes for UTF-8) are retrieved by the codeunit
function:
julia> codeunit(s, 3)
0xce
julia> codeunits(s)
6-element Base.CodeUnits{UInt8,String}:
0xce
0xb1
0xce
0xb2
0xce
0xb3
Substrings work similarly: you give them code-unit indices, but they still give you the whole string of Unicode characters:
julia> s[3:5] # a copy
"βγ"
julia> SubString(s, 3:5) # a view
"βγ"
So, the point is, once you convert your character offsets to Julia String
indices, you are still working with characters.