Previously, str[i:j] worked, if j was the index of the last codeunit in an encoded codepoint (UTF-8, in the case of String), however you need to do str[i:thisind(str,j)] to make it work, which, besides complicating a lot of code, can hurt performance, by making the code have to walk backwards (not even possible in many encodings!) to find first index of the character (and then, internally, it has to walk forwards again, to find the final codeunit that belongs in the slice!)
This is causing a bunch of breakage in code using String (another reason to stop using String, I suppose).
In what situations do you get an index to the last codeunit in a character? There have been lots of discussions around string indexing changes, and the idea behind this particular change was that you’re not supposed to compute indices manually: they are either returned by string functions, or obtained via nextind/prevind/thisind.
It’s very common. When processing a string in JSON, or splitting a string by \n, you frequently have the offset of the first codeunit of the start character, and the offset of the first codeunit of the next character.
Instead of simply doing SubString(str, i, j-1) or str[i:j-1], you need to do the slower (and not possible on most encodings) str[i:thisind(j-1)] or str[i:prevind(j)]).
I am surprised that a change that hurts performance, generality, and breaks a good amount of string processing code was merged.
One option would be to make s[i:prevind(j)] work and be fast.
Currently, prevind(j) is an error (you have to pass s), but it could instead construct a PrevInd(j) object. The range constructor could specialize on this type to make a “halfopen” range object that excludes the right endpoint. Finally, getindex would have a special method for “halfopen” ranges.
(In fact, this could be probably be made slightly faster than the 0.6 s[i:j-1] because it would only have to check isvalid(s,j). We could turn off the isvalid checks entirely for @inbounds code.)
(The error message for s[i:j] could suggest this when !isvalid(s,j) && isvalid(s,j+1), to help people discover the prevind(j) construction. Or we could use s[i:PrevInd(j)] to make it clearer that a new type is being constructed here.)
All of that seems to go very much against the original design for UTF-8 string indexing in Julia, i.e. that it was always indexing the code units, for performance reasons.
It also goes against the “new” philosophy of #24999, that a String does not have to be valid UTF-8.
I’m definitely not going to follow these breaking and performance degrading changes in Strs.jl.
A simpler check for validity that I will use is that either j is sizeof(str), or isvalid(str, j+1).
(To be compatible with the change for String on master, I suppose I can also accept isvalid(str, j) as well, and then skip forward to the last codeunit of the codepoint starting at j).
I think one of the reasons is that we can always turn errors into supported patterns in 1.x releases, while we cannot make changes in the reverse direction. So it’s less problematic to make indexing stricter in 1.0 and figure whether it’s a problem in practice.
Regarding the specific SubString example, I wonder whether performance has regressed for other reasons. Indeed on 0.6.2 the constructor called isvalid until it reached the first codeunit of the character anyway, which is equivalent to calling prevind first. On 0.7 we compute the number of codeunits instead, and store it in the object. Maybe it takes more time. I guess the only way to check is to allow indexing the last code unit, and see whether you can match the 0.6.2 performance.
I find @stevengj’s idea interesting, even though it’s more complex. Indeed, what annoys me with allowing indexing the last code unit in a character is that s[i-1] would be valid, efficient and recommended, but s[i+1] would be a mistake. I’m afraid this would really confuse users who are not familiar with Unicode. So providing another syntax which does not involve arbitrary arithmetic could be a good idea.
I am not saying that at all. This is only about SubString, and getindex with a range (which for some reason does a copy, even though the string is immutable - seems like it would be better to return a SubString in that case).
The starting index should always be valid, IMO. The ending index of a substring or slice should really be the index of the last code unit in that substring, NOT the first code unit of the last code point (except when the first code unit is the last, of course!)
I also didn’t say that the ending index should be allowed to be anything but the last codeunit of a character.
If you have a 4-byte codepoint, such as "\U1f596", then s should be valid, s[1:4] should be valid, but s, s, s should be invalid, and s[1:1], s[1:2] and s[1:3] should also be invalid.
The slicing operation shouldn’t really have to know about the underlying encoding - it should simply operate on the code units themselves.
For the Str types that are guaranteed to be valid (such as UniStr,UTF8Str, or UTF16Str) you’d get an error, but not for the ones that don’t make that guarantee (Text1Str, for example) you wouldn’t.
Since String no longer even purports to contain valid UTF-8, why should it even bother to give an error for any of those slicing operations?
Yes, indexing strings by UnitRanges is already strange – they’re not selecting all those consecutive indices but instead just representing the end points. The array equivalent would be a ClosedInterval from IntervalSets.
What you are doing is selecting all the code units in the range. I had considered the idea that we could allow arbitrary (inbounds) i and j in s[i:j] and just do the slicing by code units. That’s a bit dangerous, but it is a consistent mental model. Overall, I don’t think it’s the best idea, however, since it would imply that you always need to give j as the index of the last code unit of the last character, which is often not convenient. Worse still, this code will look like it’s working until the last character happens to be a multi-code-unit character, at which point you’ll get mysterious trailing junk. So scratch that idea.
One way to spell the operation in question is this:
String(codeunits("∀ x ∃ y")[1:9])
Of course, the issue then is to make sure that’s efficient enough. I haven’t measured so I’m not sure if it is or isn’t, but it’s the “right” way to do this currently.
This currently makes an extra copy of the data, but that would be fixed by adding a CodeUnits slicing method that allocates the copy using StringVector (so that String(...) will take ownership of the new array).
This doesn’t help with the SubString case, though.
allow alternative slicing of code units and make it efficient. One can imagine two options: a) without making a copy of data and b) with making copy of data.
one has to remember that SubString on invalid slice will not work correctly as @stevengj mentions, as many underlying functions fall back to string wrapped by SubString and would produce wrong results.
Why is that difficult? If you don’t already have the last code unit, simply use nextind(str, j) - 1, which should always work (and in the case where j is a valid index to the start of a code point, will work with any encoding (such as GB 18030 - official character set encoding for almost 1.4 billion people!), not just one like UTF-8, which is self-synchronizing.
Having to use prevind or thisind, which may not be possible generically, doesn’t seem like a good approach.
Why would that happen? If you don’t know for sure that you have the index of the last code unit (or have the index of the first code unit of the next code point so you can simply subtract 1), you can simply call nextind.
I think making it consistent with slicing the vector of bytes makes a lot more sense - i.e.
The behavior has not changed, only the set of inputs which are considered valid. You seem to be arguing for a different breaking change which is to require that j be the index of the last code unit of the last character. That has never been the way string slices have worked so saying that this is changing back is not accurate.
You forgot, that I also stated that to retain compatibility, it would also be possible to accept the first code unit also.
Note: that shouldn’t really affect performance - because it would only check to see if it needs to find the final index in a case where it would otherwise be invalid in the new scheme.