Substring function?

Iterating over letters (code points) of a string gives you byte indexes I believe (just not necessarily consecutive), so at some (lower-level, yes) point you need byte-indexing, e.g. for SubStrings. I would have liked indexing (byte-based or otherwise) to not be defined on Strings (by default), but by now that would be a breaking change. That’s my plan for a string-type I’m developing…

One other application is, strings in Julia do not have to hold UTF-8 data (but you can optionally validate as such). It could either be e.g. byte-based ISO-8859-1 or any kind of binary data (even UTF-16), or malformed UTF-8. So I suppose if you know you’re dealing with byte-based you might want byte-indexing (as opposed to the provided iterator)… or every other byte for UTF-16 (for code-points, not letters, recall UTF-16 is also variable-length because of surrogates).

The reason byte indexes are useful is that they are the thing that can be looked up in O(1) so if you search for graphemes, and return bytes, it makes it easy to quickly parse through a string.

1 Like

In particular, you should think of the byte (codeunit) index as simply an opaque “pointer” into the string which is returned by iteration, searching, etcetera. The user doesn’t need to understand how it relates to the encoding. The important thing is that once you have this opaque pointer, you can jump to that position quickly (O(1)).

Put another way, why have indices at all? The reason for them is to save locations in a string. But you don’t necessarily need to know what those locations “mean” as long as you can do various things with them (e.g. reading at a location, comparing two locations by ≤ or <, grabbing a substring between two locations, …).

There is a common misconception lurking — normalization is not sufficient to merge all graphemes into single characters. Unicode has “merged” characters that include the accent only for a small subset of characters and diacritical marks; everything else will use a combining character even in NFC.

For example, the string "x̂" contains only one grapheme (one “user-visible” character) but is two characters (two Unicode codepoints) in any normalization.

3 Likes