How to iterate over unicode characters with multiple codepoints

Hi this code

str = "Héllo World"
for i in str 
    println(i)
end

produce this output

H
e
́
l
l
o

W
o
r
l
d

How to make it produce

H
é
l
l
o

W
o
r
l
d 

(On a side note, if you run it in vs code repl the line with only ́ from the first output looks empty)

1 Like

looks like they are truly two characters:

julia> ary = [str[2:3]...]
2-element Array{Char,1}:
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

julia> join(ary)
"é"

julia> ary[1] = 'a'; join(ary)
"á"
1 Like

Unicode indexing is best done via eachindex, which returns the valid byte indices corresponding to unicode codepoints.

That said, @jling is correct - ´ in unicode is a valid “character”, though a combining mark.

The manual has a lot of very useful information about this.

1 Like

Note that this works as expected

julia> str = "Héllo World"
"Héllo World"

julia> str1 = "Héllo World"
"Héllo World"

julia> split(str, "")
12-element Array{SubString{String},1}:
 "H"
 "e"
 "́"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

julia> split(str1, "")
11-element Array{SubString{String},1}:
 "H"
 "é"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

it depends on whether you use U+0065 and U+0301 or U+00E9.
However, in cases that require combinations the question remains open.

3 Likes

You can use Unicode.graphemes to iterate over graphemes (“user-perceived characters” in unicode), regardless of how they are encoded in code points:

julia> using Unicode

julia> graphemes("Héllo World")
length-11 GraphemeIterator{String} for "Héllo World"

julia> graphemes("Héllo World") |> collect
11-element Array{SubString{String},1}:
 "H"
 "é"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

Note that the second element of this array is a string of 2 code points (“2 characters” in the terminology of Julia docs)

9 Likes