How to iterate over unicode characters with multiple codepoints

systems · October 6, 2020, 3:13am

Hi this code

str = "Héllo World"
for i in str 
    println(i)
end

produce this output

H
e
́
l
l
o

W
o
r
l
d

How to make it produce

H
é
l
l
o

W
o
r
l
d

(On a side note, if you run it in vs code repl the line with only ́ from the first output looks empty)

jling · October 6, 2020, 4:34am

looks like they are truly two characters:

julia> ary = [str[2:3]...]
2-element Array{Char,1}:
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

julia> join(ary)
"é"

julia> ary[1] = 'a'; join(ary)
"á"

Sukera · October 6, 2020, 5:54am

Unicode indexing is best done via eachindex, which returns the valid byte indices corresponding to unicode codepoints.

That said, @jling is correct - ´ in unicode is a valid “character”, though a combining mark.

The manual has a lot of very useful information about this.

Nosferican · October 6, 2020, 1:08pm

Note that this works as expected

julia> str = "Héllo World"
"Héllo World"

julia> str1 = "Héllo World"
"Héllo World"

julia> split(str, "")
12-element Array{SubString{String},1}:
 "H"
 "e"
 "́"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

julia> split(str1, "")
11-element Array{SubString{String},1}:
 "H"
 "é"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

it depends on whether you use U+0065 and U+0301 or U+00E9.
However, in cases that require combinations the question remains open.

yha · October 6, 2020, 1:28pm

You can use Unicode.graphemes to iterate over graphemes (“user-perceived characters” in unicode), regardless of how they are encoded in code points:

julia> using Unicode

julia> graphemes("Héllo World")
length-11 GraphemeIterator{String} for "Héllo World"

julia> graphemes("Héllo World") |> collect
11-element Array{SubString{String},1}:
 "H"
 "é"
 "l"
 "l"
 "o"
 " "
 "W"
 "o"
 "r"
 "l"
 "d"

Note that the second element of this array is a string of 2 code points (“2 characters” in the terminology of Julia docs)

Topic		Replies	Views
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2493	January 12, 2024
Indexing Unicode Strings Internals & Design	10	1721	June 4, 2021
Problems with strings which contain unicode characters on julia 0.7-DEV General Usage	2	529	October 6, 2017
Unexpected index of Unicode subscript `char` in `string`? General Usage	8	856	June 25, 2021
UnicodeREPL.jl - Type any Unicode character in the REPL Package Announcements repl , unicode , codepoint	6	936	July 4, 2024

How to iterate over unicode characters with multiple codepoints

Related topics