Why Julia have this accentuation problem?
How to deal with this?
julia> length("cão")
3
julia> "cão"[3]
ERROR: StringIndexError("cão", 3)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:220
[3] getindex(::String, ::Int64) at ./strings/string.jl:213
[4] top-level scope at REPL[2]:1
Even more weird:
julia> "cão"[4]
'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
Thanks.
1 Like
Hi @fredrikekre ,
So, if I really want the ith position of a UTF-8 string, like this:
julia> s = "cão"
"cão"
julia> val(s, 2)
'ã': Unicode U+00E3 (category Ll: Letter, lowercase)
julia> sub(s, 1, 2)
"cã"
julia> sub(s, 1:2)
"cã"
julia> sub(s, 2, 3)
"ão"
julia> sub(s, 2:3)
"ão"
I need something like this:
function ind(s::String, i::Int, k::Int = 1)
for _ = 1:i - 1
k = nextind(s, k)
end
return k
end
function interval(s::String, i::Int, f::Int)
start = ind(s, i)
stop = ind(s, f - i + 1, start)
return start:stop
end
sub(s::String, i::Int, f::Int) =
s[interval(s, i, f)]
sub(s::String, i::UnitRange) =
s[interval(s, i.start, i.stop)]
val(s::String, i::Int) =
s[ind(s, i)]
Or there is another way to do it?
Thanks.
Assuming your mental mode is that strings are made of graphemes, you can do:
julia> import Unicode: graphemes
julia> s = "cão"
"cão"
julia> collect(graphemes(s))[1]
"c"
julia> collect(graphemes(s))[2]
"ã"
julia> collect(graphemes(s))[3]
"o"
4 Likes
The i
-th codepoint is given by s[nextind(s, 0, i)]
:
julia> s = "cão"
"cão"
julia> [s[nextind(s, 0, i)] for i = 1:3]
3-element Array{Char,1}:
'c'
'ã'
'o'
However, realize that finding the i
-th codepoint is O(i) (linear) complexity for the UTF-8 encoding or any variable-width encoding.
The real question is why you want the i
-th codepoint. Usually, random positions in strings arise from other processing, e.g. searches, in which the index is already computed as a byproduct.
As @johnmyleswhite alluded to, the notion of a “character” in Unicode might not be what you expect. The strings s = "cão"
and s2 = "cão"
may look the same, and are canonically equivalent, but s2
actually has 4 Unicode codepoints (“characters”) even though it has 3 graphemes (what most users would consider “characters”), because in s2
the ã
is made from an ASCII a
followed by a U+0303 “combining tilde”. So, thinking in terms of the i
-th “position” in a string may indicate a conceptual misunderstanding of Unicode.
7 Likes