Substring function?

Not efficiently.

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

That’s why, in my previous post, I asked you where your m and n indices come from. You still haven’t given any usecase for your substring(s, m, n) function. In what realistic application would someone say "give me the 12th to the 17th characters of this string, where the numbers 12 and 17 just fell out of the sky (not from a previous search/iteration on the string)?

(One option we’ve discussed is literally making string indices an opaque type, so that it no longer resembles a consecutive array index, which is the main source of confusion.)

With UTF-16 as in Java, you have exactly the same issue, except that bugs are harder to catch because surrogate pairs are less common. With UTF-8, you catch the indexing bugs in your code the first time anyone passes you a Unicode string.

5 Likes