String getindex problem?

Why Julia have this accentuation problem?
How to deal with this?

julia> length("cão")
3

julia> "cão"[3]
ERROR: StringIndexError("cão", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:220
 [3] getindex(::String, ::Int64) at ./strings/string.jl:213
 [4] top-level scope at REPL[2]:1

Even more weird:

julia> "cão"[4]
'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)

Thanks.

1 Like

See Strings · The Julia Language

3 Likes

Hi @fredrikekre ,

So, if I really want the ith position of a UTF-8 string, like this:

julia> s = "cão"
"cão"

julia> val(s, 2)
'ã': Unicode U+00E3 (category Ll: Letter, lowercase)

julia> sub(s, 1, 2)
"cã"

julia> sub(s, 1:2)
"cã"

julia> sub(s, 2, 3)
"ão"

julia> sub(s, 2:3)
"ão"

I need something like this:

function ind(s::String, i::Int, k::Int = 1)
    for _ = 1:i - 1
        k = nextind(s, k)
    end    
    return k
end

function interval(s::String, i::Int, f::Int)
    start = ind(s, i)
    stop = ind(s, f - i + 1, start)    
    return start:stop
end

sub(s::String, i::Int, f::Int) =
    s[interval(s, i, f)]

sub(s::String, i::UnitRange) =
    s[interval(s, i.start, i.stop)]

val(s::String, i::Int) =
    s[ind(s, i)]

Or there is another way to do it?

Thanks.

Assuming your mental mode is that strings are made of graphemes, you can do:

julia> import Unicode: graphemes

julia> s = "cão"
"cão"

julia> collect(graphemes(s))[1]
"c"

julia> collect(graphemes(s))[2]
"ã"

julia> collect(graphemes(s))[3]
"o"
4 Likes

The i-th codepoint is given by s[nextind(s, 0, i)]:

julia> s = "cão"
"cão"

julia> [s[nextind(s, 0, i)] for i = 1:3]
3-element Array{Char,1}:
 'c'
 'ã'
 'o'

However, realize that finding the i-th codepoint is O(i) (linear) complexity for the UTF-8 encoding or any variable-width encoding.

The real question is why you want the i-th codepoint. Usually, random positions in strings arise from other processing, e.g. searches, in which the index is already computed as a byproduct.

As @johnmyleswhite alluded to, the notion of a “character” in Unicode might not be what you expect. The strings s = "cão" and s2 = "cão" may look the same, and are canonically equivalent, but s2 actually has 4 Unicode codepoints (“characters”) even though it has 3 graphemes (what most users would consider “characters”), because in s2 the is made from an ASCII a followed by a U+0303 “combining tilde”. So, thinking in terms of the i-th “position” in a string may indicate a conceptual misunderstanding of Unicode.

7 Likes