Performance of length(::String)

Argh… makes perfect sense! Thanks for the clarification.

The sizeof function is a bit of a misrecommendation here. It gives the size in bytes of the string representation. However, what you generally want is the number of code units not the number of bytes, which is given by ncodeunits(s). In the case of String the code unit is a byte so these coincide, but indexing in abstract strings in general is in terms of code units, not bytes. Since ncodeunits is undefined for characters, the above confusion would not occur: only ncodeunits(s[1:1]) makes sense and ncodeunits(s[1]) is an error. I need to rewrite the string docs to explain the general string model but there has been more pressing work until now.

1 Like

lastindex() wouldn’t be able to work on most variable-width encodings, because it depends on being able to move in the reverse direction (the only encodings I know of that can are UTF-8 and UTF-16, and I’m not even sure that it will get the same set of characters if there are invalid sequences as in the forward direction, but String no longer makes any guarantees that the string is valid UTF-8).

A number of string handling functions in Julia depend on lastindex, prevind and thisind working, such as creating SubStrings, which means that they are not useful for handling arbitrary variable-width encodings.

There are also performance issues with having to read 1-4 (or 5 or 6) bytes going backwards to find the first byte (of a valid sequence - it’s a lot more complicated if you have to also deal with possibly invalid sequences)
Unfortunately, SubString doesn’t even have a constructor that allows you to bypass that.

As far as I remember it was thoroughly checked that you will. To make sure I have just written a randomized test that confirms that all works correctly.

function rs()
    x = rand(UInt8, 10^8)
    s = String(x)
    i = 2
    curidx = 1
    nextidx = nextind(s, 1)
    while i <= ncodeunits(s)
        if i < nextidx
            thisind(s, i) == curidx || @error i
            if i > curidx
                prevind(s, i) == curidx || @error i
            end
        else
            isvalid(s, i) || @error i
            thisind(s, i) == i || @error i
            curidx = i
            nextidx = nextind(s, i)
        end
    i += 1
    end
end

That’s good to know.
I think there are still some inconsistencies, because of different ways of handling UTF-8 sequences that encode numbers from 0x110000 - 0x7fffffff (those were allowed for the first 10 years of the UTF-8 spec, before they decided to limit things to be consistent with the range of UTF-16, i.e. 0 - 0x10ffff, and many programs, including both internals of PCRE and Julia itself [in the utf8.c that Jeff wrote in 2005], treat those 5-6 byte sequences as a single character, unlike what String currently does.