String getindex problem?

The i-th codepoint is given by s[nextind(s, 0, i)]:

julia> s = "cão"
"cão"

julia> [s[nextind(s, 0, i)] for i = 1:3]
3-element Array{Char,1}:
 'c'
 'ã'
 'o'

However, realize that finding the i-th codepoint is O(i) (linear) complexity for the UTF-8 encoding or any variable-width encoding.

The real question is why you want the i-th codepoint. Usually, random positions in strings arise from other processing, e.g. searches, in which the index is already computed as a byproduct.

As @johnmyleswhite alluded to, the notion of a “character” in Unicode might not be what you expect. The strings s = "cão" and s2 = "cão" may look the same, and are canonically equivalent, but s2 actually has 4 Unicode codepoints (“characters”) even though it has 3 graphemes (what most users would consider “characters”), because in s2 the is made from an ASCII a followed by a U+0303 “combining tilde”. So, thinking in terms of the i-th “position” in a string may indicate a conceptual misunderstanding of Unicode.

7 Likes