Hi this code
str = "Héllo World"
for i in str
println(i)
end
produce this output
H
e
́
l
l
o
W
o
r
l
d
How to make it produce
H
é
l
l
o
W
o
r
l
d
(On a side note, if you run it in vs code repl the line with only ́ from the first output looks empty)
1 Like
jling
2
looks like they are truly two characters:
julia> ary = [str[2:3]...]
2-element Array{Char,1}:
'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
'́': Unicode U+0301 (category Mn: Mark, nonspacing)
julia> join(ary)
"é"
julia> ary[1] = 'a'; join(ary)
"á"
1 Like
Sukera
3
Unicode indexing is best done via eachindex
, which returns the valid byte indices corresponding to unicode codepoints.
That said, @jling is correct - ´
in unicode is a valid “character”, though a combining mark.
The manual has a lot of very useful information about this.
1 Like
Note that this works as expected
julia> str = "Héllo World"
"Héllo World"
julia> str1 = "Héllo World"
"Héllo World"
julia> split(str, "")
12-element Array{SubString{String},1}:
"H"
"e"
"́"
"l"
"l"
"o"
" "
"W"
"o"
"r"
"l"
"d"
julia> split(str1, "")
11-element Array{SubString{String},1}:
"H"
"é"
"l"
"l"
"o"
" "
"W"
"o"
"r"
"l"
"d"
it depends on whether you use U+0065 and U+0301 or U+00E9.
However, in cases that require combinations the question remains open.
3 Likes
yha
5
You can use Unicode.graphemes
to iterate over graphemes (“user-perceived characters” in unicode), regardless of how they are encoded in code points:
julia> using Unicode
julia> graphemes("Héllo World")
length-11 GraphemeIterator{String} for "Héllo World"
julia> graphemes("Héllo World") |> collect
11-element Array{SubString{String},1}:
"H"
"é"
"l"
"l"
"o"
" "
"W"
"o"
"r"
"l"
"d"
Note that the second element of this array is a string of 2 code points (“2 characters” in the terminology of Julia docs)
9 Likes