Substring function?

stevengj · February 19, 2022, 1:55pm

Not efficiently.

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

That’s why, in my previous post, I asked you where your m and n indices come from. You still haven’t given any usecase for your substring(s, m, n) function. In what realistic application would someone say "give me the 12th to the 17th characters of this string, where the numbers 12 and 17 just fell out of the sky (not from a previous search/iteration on the string)?

(One option we’ve discussed is literally making string indices an opaque type, so that it no longer resembles a consecutive array index, which is the main source of confusion.)

With UTF-16 as in Java, you have exactly the same issue, except that bugs are harder to catch because surrogate pairs are less common. With UTF-8, you catch the indexing bugs in your code the first time anyone passes you a Unicode string.

Topic		Replies	Views
SubString doesn't work with unicode New to Julia question , unicode	13	1449	June 17, 2022
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1415	December 5, 2023
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2521	January 12, 2024
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2314	February 12, 2018
StringIndex idea (Julia 2.0) Internals & Design strings , indexing	72	3345	March 27, 2024

Substring function?

Related topics