Argh… makes perfect sense! Thanks for the clarification.
The sizeof
function is a bit of a misrecommendation here. It gives the size in bytes of the string representation. However, what you generally want is the number of code units not the number of bytes, which is given by ncodeunits(s)
. In the case of String
the code unit is a byte so these coincide, but indexing in abstract strings in general is in terms of code units, not bytes. Since ncodeunits
is undefined for characters, the above confusion would not occur: only ncodeunits(s[1:1])
makes sense and ncodeunits(s[1])
is an error. I need to rewrite the string docs to explain the general string model but there has been more pressing work until now.
lastindex()
wouldn’t be able to work on most variable-width encodings, because it depends on being able to move in the reverse direction (the only encodings I know of that can are UTF-8 and UTF-16, and I’m not even sure that it will get the same set of characters if there are invalid sequences as in the forward direction, but String
no longer makes any guarantees that the string is valid UTF-8).
A number of string handling functions in Julia depend on lastindex
, prevind
and thisind
working, such as creating SubString
s, which means that they are not useful for handling arbitrary variable-width encodings.
There are also performance issues with having to read 1-4 (or 5 or 6) bytes going backwards to find the first byte (of a valid sequence - it’s a lot more complicated if you have to also deal with possibly invalid sequences)
Unfortunately, SubString
doesn’t even have a constructor that allows you to bypass that.
As far as I remember it was thoroughly checked that you will. To make sure I have just written a randomized test that confirms that all works correctly.
function rs()
x = rand(UInt8, 10^8)
s = String(x)
i = 2
curidx = 1
nextidx = nextind(s, 1)
while i <= ncodeunits(s)
if i < nextidx
thisind(s, i) == curidx || @error i
if i > curidx
prevind(s, i) == curidx || @error i
end
else
isvalid(s, i) || @error i
thisind(s, i) == i || @error i
curidx = i
nextidx = nextind(s, i)
end
i += 1
end
end
That’s good to know.
I think there are still some inconsistencies, because of different ways of handling UTF-8 sequences that encode numbers from 0x110000 - 0x7fffffff (those were allowed for the first 10 years of the UTF-8 spec, before they decided to limit things to be consistent with the range of UTF-16, i.e. 0 - 0x10ffff, and many programs, including both internals of PCRE and Julia itself [in the utf8.c that Jeff wrote in 2005], treat those 5-6 byte sequences as a single character, unlike what String
currently does.