Substring function?

Why not? Idiosyncrasy just means “: a peculiarity of constitution or temperament : an individualizing characteristic or quality” (Idiosyncrasy Definition & Meaning - Merriam-Webster), if Julia is different in this aspect than most languages then the word fits.

1 Like

Because it hardly is a individualizing characteristic - lots of languages have indexing into their strings that’s not based on codepoints. I also remember some discussion some time back about only indexing based on graphemes, which would come with a whole different set of problems.

I think the idiosyncrasy mentioned is the mix of both?

Not efficiently.

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

That’s why, in my previous post, I asked you where your m and n indices come from. You still haven’t given any usecase for your substring(s, m, n) function. In what realistic application would someone say "give me the 12th to the 17th characters of this string, where the numbers 12 and 17 just fell out of the sky (not from a previous search/iteration on the string)?

(One option we’ve discussed is literally making string indices an opaque type, so that it no longer resembles a consecutive array index, which is the main source of confusion.)

With UTF-16 as in Java, you have exactly the same issue, except that bugs are harder to catch because surrogate pairs are less common. With UTF-8, you catch the indexing bugs in your code the first time anyone passes you a Unicode string.

5 Likes

I am parsing unicode text in columns. Like unix cut.

When you parse, you are looping over the string or using findnext or similar. Hence you can use the actual string index from that loop/search to subsequently extract substrings — you neither need nor want the codepoint count.

(If you are parsing with ASCII delimiters, you often don’t even need to look at Unicode characters…you literally don’t care what comes between the delimiters when parsing. In this case you can alternatively loop over the raw codeunits/byte array, provided by the codeunits(str) function. This still gives you indices that you can use to slice the original string. That’s what e.g. JSON parsing does IIRC.)

3 Likes

No, I didn’t get that far, as I am just experimenting with the language. I imagine the final implementation would use regexp matching for parsing as the text also contains tags. But I wanted to try out a few things out first… like, hey lets just cut out this field and sum it up. My substring function works fine, I was just curious about why you have chop instead.

Yes, that’s the problem.

You were probably trying to test things out by coming up with your indices “visually” on some test strings, and noticed that the indices did not coincide with your visual expectation for Unicode strings. But this would be a problem even if Julia used vectors of codepoints, as my "ü" example showed.

For example, if I gave you the string "ü,é,â,ỳ" and asked you to cut out the 3rd and 4th comma-separated fields, analogous to cut -d -f3-4, what “characters” (codepoints) do you think that corresponds to? Probably you would guess “the 5th to 7th characters”. But no, look what your substring function returns:

julia> substring("ü,é,â,ỳ", 5,7)
"́,a"

Whoops, is your substring function buggy? No, it’s just that “codepoints” in Unicode don’t necessarily correspond to what a human reader thinks of as a “character”.

Whereas if you actually implemented a cut function, you would do a sequence of searches for the delimiter (e.g. with findnext), which would yield a sequence of string indices (≠ codepoint counts), slicing would work just fine, and it would be absolutely irrelevant how many codepoints occurred between one delimiter and the next.

But because you hadn’t gotten that far, you jumped to the conclusion that Julia’s string handling is broken and we are missing extremely basic functionality like extracting substrings.

(The UTF-8 encoding that Julia employs is not unusual! It’s taking over most of the internet, it’s used in other modern languages like Go and Swift, and it’s been the subject of many, many discussions and revisions in Julia itself. This is not something we picked out of a hat because we hadn’t thought through basic functionality.)

8 Likes

Hey, thanks for the reply. I didn’t mean to criticise Julia, or suggest that Julia’s unicode implementation is broken.

1 Like
julia> SubString("äöü",2,3)
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'ä', [3]=>'ö'

Reading this discussion and I see the reasons. Also knowing about Unicode/UTF-8/… difficulties.
But…

I would prefer having just

julia> SubString("äöü",2,3)
"öü"

even if it would be not efficient.
(not discussing function substring or view-like SubString here, doesn’t matter for me)

I don’t think expecting a working substring functionality isn’t that uncommon or even wrong like in “you are using substring, you are doing it probably wrong”.

In my opinion: the best reasoning doesn’t help if people just expect it validly. Expecting some kind of substring is valid in my opinion for any higher order programming language (of course not for assembler).

Julia has a working substring functionality, which works on string indices.

The proposed character-indexing method does not eliminate the complexities of Unicode. Consider:

julia> substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

julia> substring("äöü", 2,3) # NFC normalized string
"öü"

julia> substring("äöü", 2,3) # NFD normalized string
"̈o"

The supposed simplicity of “character indexing” is an illusion.

You pay a big price in performance to reduce apparent confusion on people’s first few days of using strings in Julia, but only postpone your Unicode bugs (because “characters” don’t mean what you think), and you get zero benefits in the long run (because indexing codepoint counts is not actually necessary for realistic string processing).

6 Likes

A significant difference from UTF-8 is that there is no such thing as malformed UTF-16, only unpaired surrogate code units, which still have code points, just not ones that correspond to valid Unicode characters. So every UTF-16 string can be iterated as code points, you might just get the code point for an unpaired surrogate if the string is invalid. Compare that with UTF-8 where some byte sequences just don’t follow the right structure at all.

(Another way to put this is that every UTF-16 sequence, valid or invalid, can be represented as a sequence of code points using WTF-8, which is an extension of UTF-8 allowing surrogate pair code points.)

4 Likes

We could certainly add a graphemes(str, m, n) method that returns a substring of the m-th to n-th graphemes, which is close to a user-perceived character slice.

But is this actual useful functionality in any real application? (When was the last time you used the graphemes(str) iterator, for that matter? At least grapheme iteration has some practical uses, e.g. cursor movement, but random grapheme access seems basically useful only for string demos.)

As far as I can tell, its sole practical utility would be to end discussions like this one. (Which might be worth a < 10 ≈25 line function in the Unicode stdlib, I suppose.)

https://github.com/JuliaLang/julia/pull/44266

4 Likes

To Steve:

using Unicode
graphemes(('\U1F44D'*'\U0352')^100)

Isn’t grapheme slicing the way to generate an icon like this?

image

No, you just need to call first on the graphemes iterator to get the “first letter”.

In contrast, general slicing means (simulated) random access. No one has so far given any practical use-case for this in which you don’t already have (or can easily get) string indices.

5 Likes

But then, what is the real application of byte-indexing a substring? It looks pretty low-level to me—I understand you might need such indexing at the lowest level, manipulating bytes in doing the encoding/decoding, but I suppose this could also be done by a convert to Vector{UInt8} or something similar.

And sure, there is the NFC/NFD normalization choice, that users of substring/graphemes should be aware of. In my real applications I sometimes use NFD, when I am interested in diacritics, and otherwise NFC. And sure, I have burnt myself working with un-normalized Unicode strings. Copying from your example "äöü" != "äöü", which is also somewhat counter intuitive.

I can also imaging that modern programming paradigms discourage people from doing any kind of indexing of strings, e.g., by providing the most amazing set of string manipulation and iteration functions.

Iterating over letters (code points) of a string gives you byte indexes I believe (just not necessarily consecutive), so at some (lower-level, yes) point you need byte-indexing, e.g. for SubStrings. I would have liked indexing (byte-based or otherwise) to not be defined on Strings (by default), but by now that would be a breaking change. That’s my plan for a string-type I’m developing…

One other application is, strings in Julia do not have to hold UTF-8 data (but you can optionally validate as such). It could either be e.g. byte-based ISO-8859-1 or any kind of binary data (even UTF-16), or malformed UTF-8. So I suppose if you know you’re dealing with byte-based you might want byte-indexing (as opposed to the provided iterator)… or every other byte for UTF-16 (for code-points, not letters, recall UTF-16 is also variable-length because of surrogates).

The reason byte indexes are useful is that they are the thing that can be looked up in O(1) so if you search for graphemes, and return bytes, it makes it easy to quickly parse through a string.

1 Like

In particular, you should think of the byte (codeunit) index as simply an opaque “pointer” into the string which is returned by iteration, searching, etcetera. The user doesn’t need to understand how it relates to the encoding. The important thing is that once you have this opaque pointer, you can jump to that position quickly (O(1)).

Put another way, why have indices at all? The reason for them is to save locations in a string. But you don’t necessarily need to know what those locations “mean” as long as you can do various things with them (e.g. reading at a location, comparing two locations by ≤ or <, grabbing a substring between two locations, …).

There is a common misconception lurking — normalization is not sufficient to merge all graphemes into single characters. Unicode has “merged” characters that include the accent only for a small subset of characters and diacritical marks; everything else will use a combining character even in NFC.

For example, the string "x̂" contains only one grapheme (one “user-visible” character) but is two characters (two Unicode codepoints) in any normalization.

3 Likes