String getindex problem?

stevengj · June 20, 2020, 1:37am

The i-th codepoint is given by s[nextind(s, 0, i)]:

julia> s = "cão"
"cão"

julia> [s[nextind(s, 0, i)] for i = 1:3]
3-element Array{Char,1}:
 'c'
 'ã'
 'o'

However, realize that finding the i-th codepoint is O(i) (linear) complexity for the UTF-8 encoding or any variable-width encoding.

The real question is why you want the i-th codepoint. Usually, random positions in strings arise from other processing, e.g. searches, in which the index is already computed as a byproduct.

As @johnmyleswhite alluded to, the notion of a “character” in Unicode might not be what you expect. The strings s = "cão" and s2 = "cão" may look the same, and are canonically equivalent, but s2 actually has 4 Unicode codepoints (“characters”) even though it has 3 graphemes (what most users would consider “characters”), because in s2 the ã is made from an ASCII a followed by a U+0303 “combining tilde”. So, thinking in terms of the i-th “position” in a string may indicate a conceptual misunderstanding of Unicode.

Topic		Replies	Views
SubString doesn't work with unicode New to Julia question , unicode	13	1446	June 17, 2022
Unexpected index of Unicode subscript `char` in `string`? General Usage	8	860	June 25, 2021
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2517	January 12, 2024
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1413	December 5, 2023
Weird string slicing in korean Performance	3	479	December 29, 2022

String getindex problem?

Related topics