Substring function?

I think that’s getting to bottom of the issue. Iterators use codepoints, whereas ranges use byte indexes. I see the need for byte indexing, but being new to the language, this idiosyncracy tripped me up. I expected to be working with codepoints all the way down unless explicitly converting into bytes. For example, in Java:

jshell> "αβγł€đŧŧŋ".substring(0,5)
$1 ==> "αβγł€"
jshell> "αβγł€đŧŧŋ".getBytes()
$2 ==> byte[19] { -50, -79, -50, -78, -50, -77, -59, -126, -30, -126, -84, -60, -111, -59, -89, -59, -89, -59, -117 }

It might have been possible to implement strings as a vector of codepoints and still achieve O(1) performance for indexing using a lookup table for indicies, or (to save memory) an isascii property. Too late for that though.

Note, I’m not saying that Julia should be “like Java”, but I do agree that the API should be consistent. Looks to me like the LegacyStrings implementation might have been better.

As far as I understand, the reason for this is to be able to support invalid UTF-8 in String as well. It’s quite common to have some corrupted data that’s treated as a string. It’s usually seen as a strength to be able to do that - I wouldn’t call it a “idiosyncracy”.

You may also be interested in some of these previous discussions about various parts of the String type in julia:

Java is using UTF-16, right? The same problems mentioned in the two links above should apply as well, as it’s a variable length encoding like UTF-8. I don’t know how java would treat those bad encodings though. I think java works around this problem by just not having strings decompose into an iterator of char easily, which can get quite hairy to implement in a performant way (I can’t find the links to previous discussions about that though, sorry).

This thread is starting to diverge somewhat, but I’m game…

Java chars are 16-bit, and Strings are encoded internally with UTF-16, but the String class is (unsuprisingly) implemented as an array of bytes. Before 32-bit characters become a thing, that meant you could do a lot of stuff O(1). Now, I think they have the same problem as Julia. From the javadoc:

Index values refer to `char` code units, so a supplementary
character uses two positions in a `String`.

Which means, as you said, they have the same problem I posted here, except the indexing is every 16 bits instead of 8:

jshell> "🩢🩣🩤".substring(0,1)
$119 ==> "?"

jshell> "🩢🩣🩤".substring(0,2)
$120 ==> "🩢"

jshell> "🩢🩣🩤".length()
$122 ==> 6

This is worse than I thought… Julia’s functions which work on codepoints get it right:

julia> length("🩢🩣🩤")
3

And some of the newer Java functions work with variable-length encoding:

jshell> "🩢🩣🩤".codePointCount(0,6)
$137 ==> 3

jshell> "🩢🩣🩤".codePoints().forEach(System.out::println)
129634
129635
129636

jshell> "🩢🩣🩤".codePoints().forEach(c -> System.out.println(Character.toString(c)))
🩢
🩣
🩤

Basically it is a mix of new code which works on codepoints and old code which works on 16-bit indexes. A total mess, in other words. I think Julia can do better if it is consistent about using codepoints or byte indexes. I can see that, like Java, you won’t give up O(1) indexing or break backwards compatibility to do this though.

Why not? Idiosyncrasy just means “: a peculiarity of constitution or temperament : an individualizing characteristic or quality” (Idiosyncrasy Definition & Meaning - Merriam-Webster), if Julia is different in this aspect than most languages then the word fits.

1 Like

Because it hardly is a individualizing characteristic - lots of languages have indexing into their strings that’s not based on codepoints. I also remember some discussion some time back about only indexing based on graphemes, which would come with a whole different set of problems.

I think the idiosyncrasy mentioned is the mix of both?

Not efficiently.

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

That’s why, in my previous post, I asked you where your m and n indices come from. You still haven’t given any usecase for your substring(s, m, n) function. In what realistic application would someone say "give me the 12th to the 17th characters of this string, where the numbers 12 and 17 just fell out of the sky (not from a previous search/iteration on the string)?

(One option we’ve discussed is literally making string indices an opaque type, so that it no longer resembles a consecutive array index, which is the main source of confusion.)

With UTF-16 as in Java, you have exactly the same issue, except that bugs are harder to catch because surrogate pairs are less common. With UTF-8, you catch the indexing bugs in your code the first time anyone passes you a Unicode string.

5 Likes

I am parsing unicode text in columns. Like unix cut.

When you parse, you are looping over the string or using findnext or similar. Hence you can use the actual string index from that loop/search to subsequently extract substrings — you neither need nor want the codepoint count.

(If you are parsing with ASCII delimiters, you often don’t even need to look at Unicode characters…you literally don’t care what comes between the delimiters when parsing. In this case you can alternatively loop over the raw codeunits/byte array, provided by the codeunits(str) function. This still gives you indices that you can use to slice the original string. That’s what e.g. JSON parsing does IIRC.)

3 Likes

No, I didn’t get that far, as I am just experimenting with the language. I imagine the final implementation would use regexp matching for parsing as the text also contains tags. But I wanted to try out a few things out first… like, hey lets just cut out this field and sum it up. My substring function works fine, I was just curious about why you have chop instead.

Yes, that’s the problem.

You were probably trying to test things out by coming up with your indices “visually” on some test strings, and noticed that the indices did not coincide with your visual expectation for Unicode strings. But this would be a problem even if Julia used vectors of codepoints, as my "ü" example showed.

For example, if I gave you the string "ü,é,â,ỳ" and asked you to cut out the 3rd and 4th comma-separated fields, analogous to cut -d -f3-4, what “characters” (codepoints) do you think that corresponds to? Probably you would guess “the 5th to 7th characters”. But no, look what your substring function returns:

julia> substring("ü,é,â,ỳ", 5,7)
"́,a"

Whoops, is your substring function buggy? No, it’s just that “codepoints” in Unicode don’t necessarily correspond to what a human reader thinks of as a “character”.

Whereas if you actually implemented a cut function, you would do a sequence of searches for the delimiter (e.g. with findnext), which would yield a sequence of string indices (≠ codepoint counts), slicing would work just fine, and it would be absolutely irrelevant how many codepoints occurred between one delimiter and the next.

But because you hadn’t gotten that far, you jumped to the conclusion that Julia’s string handling is broken and we are missing extremely basic functionality like extracting substrings.

(The UTF-8 encoding that Julia employs is not unusual! It’s taking over most of the internet, it’s used in other modern languages like Go and Swift, and it’s been the subject of many, many discussions and revisions in Julia itself. This is not something we picked out of a hat because we hadn’t thought through basic functionality.)

8 Likes

Hey, thanks for the reply. I didn’t mean to criticise Julia, or suggest that Julia’s unicode implementation is broken.

1 Like
julia> SubString("äöü",2,3)
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'ä', [3]=>'ö'

Reading this discussion and I see the reasons. Also knowing about Unicode/UTF-8/… difficulties.
But…

I would prefer having just

julia> SubString("äöü",2,3)
"öü"

even if it would be not efficient.
(not discussing function substring or view-like SubString here, doesn’t matter for me)

I don’t think expecting a working substring functionality isn’t that uncommon or even wrong like in “you are using substring, you are doing it probably wrong”.

In my opinion: the best reasoning doesn’t help if people just expect it validly. Expecting some kind of substring is valid in my opinion for any higher order programming language (of course not for assembler).

Julia has a working substring functionality, which works on string indices.

The proposed character-indexing method does not eliminate the complexities of Unicode. Consider:

julia> substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

julia> substring("äöü", 2,3) # NFC normalized string
"öü"

julia> substring("äöü", 2,3) # NFD normalized string
"̈o"

The supposed simplicity of “character indexing” is an illusion.

You pay a big price in performance to reduce apparent confusion on people’s first few days of using strings in Julia, but only postpone your Unicode bugs (because “characters” don’t mean what you think), and you get zero benefits in the long run (because indexing codepoint counts is not actually necessary for realistic string processing).

6 Likes

A significant difference from UTF-8 is that there is no such thing as malformed UTF-16, only unpaired surrogate code units, which still have code points, just not ones that correspond to valid Unicode characters. So every UTF-16 string can be iterated as code points, you might just get the code point for an unpaired surrogate if the string is invalid. Compare that with UTF-8 where some byte sequences just don’t follow the right structure at all.

(Another way to put this is that every UTF-16 sequence, valid or invalid, can be represented as a sequence of code points using WTF-8, which is an extension of UTF-8 allowing surrogate pair code points.)

4 Likes

We could certainly add a graphemes(str, m, n) method that returns a substring of the m-th to n-th graphemes, which is close to a user-perceived character slice.

But is this actual useful functionality in any real application? (When was the last time you used the graphemes(str) iterator, for that matter? At least grapheme iteration has some practical uses, e.g. cursor movement, but random grapheme access seems basically useful only for string demos.)

As far as I can tell, its sole practical utility would be to end discussions like this one. (Which might be worth a < 10 ≈25 line function in the Unicode stdlib, I suppose.)

https://github.com/JuliaLang/julia/pull/44266

4 Likes

To Steve:

using Unicode
graphemes(('\U1F44D'*'\U0352')^100)

Isn’t grapheme slicing the way to generate an icon like this?

image

No, you just need to call first on the graphemes iterator to get the “first letter”.

In contrast, general slicing means (simulated) random access. No one has so far given any practical use-case for this in which you don’t already have (or can easily get) string indices.

5 Likes

But then, what is the real application of byte-indexing a substring? It looks pretty low-level to me—I understand you might need such indexing at the lowest level, manipulating bytes in doing the encoding/decoding, but I suppose this could also be done by a convert to Vector{UInt8} or something similar.

And sure, there is the NFC/NFD normalization choice, that users of substring/graphemes should be aware of. In my real applications I sometimes use NFD, when I am interested in diacritics, and otherwise NFC. And sure, I have burnt myself working with un-normalized Unicode strings. Copying from your example "äöü" != "äöü", which is also somewhat counter intuitive.

I can also imaging that modern programming paradigms discourage people from doing any kind of indexing of strings, e.g., by providing the most amazing set of string manipulation and iteration functions.