Performance of length(::String)

tk3369 · July 28, 2018, 5:14pm

Argh… makes perfect sense! Thanks for the clarification.

StefanKarpinski · July 28, 2018, 6:22pm

The sizeof function is a bit of a misrecommendation here. It gives the size in bytes of the string representation. However, what you generally want is the number of code units not the number of bytes, which is given by ncodeunits(s). In the case of String the code unit is a byte so these coincide, but indexing in abstract strings in general is in terms of code units, not bytes. Since ncodeunits is undefined for characters, the above confusion would not occur: only ncodeunits(s[1:1]) makes sense and ncodeunits(s[1]) is an error. I need to rewrite the string docs to explain the general string model but there has been more pressing work until now.

ScottPJones · July 28, 2018, 8:46pm

lastindex() wouldn’t be able to work on most variable-width encodings, because it depends on being able to move in the reverse direction (the only encodings I know of that can are UTF-8 and UTF-16, and I’m not even sure that it will get the same set of characters if there are invalid sequences as in the forward direction, but String no longer makes any guarantees that the string is valid UTF-8).

A number of string handling functions in Julia depend on lastindex, prevind and thisind working, such as creating SubStrings, which means that they are not useful for handling arbitrary variable-width encodings.

There are also performance issues with having to read 1-4 (or 5 or 6) bytes going backwards to find the first byte (of a valid sequence - it’s a lot more complicated if you have to also deal with possibly invalid sequences)
Unfortunately, SubString doesn’t even have a constructor that allows you to bypass that.

bkamins · July 28, 2018, 9:28pm

As far as I remember it was thoroughly checked that you will. To make sure I have just written a randomized test that confirms that all works correctly.

function rs()
    x = rand(UInt8, 10^8)
    s = String(x)
    i = 2
    curidx = 1
    nextidx = nextind(s, 1)
    while i <= ncodeunits(s)
        if i < nextidx
            thisind(s, i) == curidx || @error i
            if i > curidx
                prevind(s, i) == curidx || @error i
            end
        else
            isvalid(s, i) || @error i
            thisind(s, i) == i || @error i
            curidx = i
            nextidx = nextind(s, i)
        end
    i += 1
    end
end

ScottPJones · July 28, 2018, 9:49pm

That’s good to know.
I think there are still some inconsistencies, because of different ways of handling UTF-8 sequences that encode numbers from 0x110000 - 0x7fffffff (those were allowed for the first 10 years of the UTF-8 spec, before they decided to limit things to be consistent with the range of UTF-16, i.e. 0 - 0x10ffff, and many programs, including both internals of PCRE and Julia itself [in the utf8.c that Jeff wrote in 2005], treat those 5-6 byte sequences as a single character, unlike what String currently does.

Topic		Replies	Views
Super slow string performance Performance performance	2	904	March 31, 2022
Julia simple String process and file I/O is slow Performance performance , parsing	12	2937	December 23, 2020
Community string benchmark suite General Usage	3	161	October 1, 2024
String optimisation in Julia General Usage performance , strings , io	21	596	September 21, 2024
Indexing Unicode Strings Internals & Design	10	1724	June 4, 2021

Performance of length(::String)

Related topics