Breakage due to changes in `String` slicing in v0.7

I think the inconsistency, loss of performance, as well as breakage of currently working code would be enough reason to change this, and the ‘codeunits’ approach is rather hard to explain.

j-1? By j do you mean the index of the next codeunit?
If so, the validation is trivial, either j-1 must be == sizeof(str), or j must be the first (or only) codeunit of a valid codepoint. Faster than what is currently being done. Just benchmark it, as I’ve done, and you’ll see.

No more so than having the rule that the end index of a SubString or range index must be the last codeunit of a valid codepoint. That would also have caught those incorrect uses of indexing that pointed into the middle of a codepoint.

Consider a string s="∀"^100 - how do you efficiently validate if s[22:54] is a correct range under your rule and why it is more efficient?

EDIT: just to explain what I mean: under current rule it is cheap to check if a codeunit is a start of a valid code point because you do not have to a perform traceback (most of the time).

EDIT 2: please note that I am not objecting against allowing selection of different slices from a string - I simply think we need a separate function for this.

On v0.6.2:

julia> f(s,i) = s[22:i]
f (generic function with 1 method)

julia> @btime f($s,54)
  17.665 ns (1 allocation: 64 bytes)
"∀∀∀∀∀∀∀∀∀∀∀"

On v0.7:

julia> f(s,i) = s[22:thisind(s,i)]
f (generic function with 2 methods)
julia> g(s,i) = s[22:i]
g (generic function with 1 method)
julia> @btime g($s,52)
  44.558 ns (1 allocation: 64 bytes)
"∀∀∀∀∀∀∀∀∀∀∀"
julia> @btime f($s,54)
  49.454 ns (1 allocation: 64 bytes)
"∀∀∀∀∀∀∀∀∀∀∀"

Does that help answer your question?

That (most of the time) is rather important. If you want to be able to have a generic rule, that works for any character set encoding, you can’t have things depending on moving backwards.

Under the rules I outlined, it is trivial. The first index of the range must be the within the bounds of the string (i.e. 1 <= index <= sizeof(str), and the code unit indexed must be the start of a code point.
The second index of the range must be in bounds also, and index the last code unit of a code point (i.e. it is the last code unit in the string, or the next index is a valid start). For UTF-8, those cases are very easy to determine (i.e. using !is_valid_continuation(ch)).

What do you mean here?

I’ve been playing with this in 0.7, and it seems a bit difficult to make the codeunits approach fast because of the cost of first allocating a Vector{UInt8} object and then a String, even if StringVector is used so that no additional data copies are made.

Some timings on my machine with s = "αβ"^10 and s[1:6]:

  • Julia 0.6 s[1:6]: 20ns
  • Julia 0.7 s[1:thisind(s,6)]: 39ns
  • Julia 0.7 String(codeunits(s)[1:6]): 54ns
  • Julia 0.7 patched so that codeunits uses a StringVector and memcpy: 40ns
  • Julia 0.7 String(cuslice(s, 1:6)) (see cuslice below) that avoids creating the CodeUnits object: 38ns
  • Julia 0.7 cuslice with bounds check disabled: 35ns
  • Julia 0.7 fastslice(s,i) = unsafe_string(pointer(s,first(i)), length(i)): 16ns

(For a longer string the extra overhead wouldn’t matter so much, of course.)

cuslice function mentioned above:

function cuslice(s::String, i::UnitRange{Int})
       @boundscheck checkbounds(s, i)
       n = length(i)
       a = Base.StringVector(n)
       t = Base.@_gc_preserve_begin s
       ccall(:memcpy, Ptr{Cvoid}, (Ptr{UInt8}, Ptr{UInt8}, UInt),
             a, pointer(s, first(i)), n)
       Base.@_gc_preserve_end t
       return a
end

My i:PrevInd(j) suggestion should achieve the optimal speed with appropriate methods, but it does involve a new range type etc. It also works with SubString, which the codeunits approach does not.

It does - g is faster than f as I have said. The thing that 0.6.2 is faster than 0.7 is a separate issue.

is_valid_continuation is not sufficient to determine if it is a last code unit under definition of String in 0.7.

With EDIT 2 I mean that I would leave s[i:j] as is, but having an additional function that does selection of code units form String as @stevengj proposes is a good idea.

I think you misunderstood. You use !is_valid_continuation(ch) to see if the next code unit after is a valid start character (i.e. not in the range 0x80-0xbf), and that, only if the second index is not == sizeof(str).

That function doesn’t make sense to me. length(i) I think is a typo.
Also, that function can’t get you the slice s[1:6]

Did you mean to have fastslice(s, i) = i <= sizeof(s) ? unsafe_string(pointer(s), i) : throw(BoundsError, s, i)

It does not make a difference unfortunately

s = "\x88\x80"

is an example of such a situation under 0.7, where s[1] and s[2] are both valid.

To close the discussion from my perspective (as an opinion to whoever would make a PR to base - I have nothing more to add):

  • I am supporting adding a different way to efficiently extract range of code units from a String as a new String (by making a copy or making a view of a subrange - whichever is more efficient or both);
  • current behavior of s[i:j] is in my opinion simplest to understand by non-experts and I would leave it as is;
  • we have to acknowledge that implementation of String will not be most efficient for all use cases - efforts to implement alternative AbstractStrings in packages are in my opinion very welcome;
  • for the above to work the crucial thing is to make sure that we have a precise documentation of AbstractString interface and it would be nice if we had AbstractChar in Base along with documentation of its interface.
2 Likes

length(i) correct. i here is a UnitRange like 1:6. The typo was that I forgot to write the first call. It should have been fastslice(s,i) = unsafe_string(pointer(s,first(i)), length(i)) (now it is fixed).

Also, that function can’t get you the slice s[1:6]

Do fastslice(s, 1:6).

I intentionally omitted bounds checking here, analogous to @inbounds code if the bounds-check is protected by @boundscheck, in order to get as close as possible to the lower bound on the time. Add a @boundscheck checkbounds(s, i) line for a safe version.

Did you mean to have fastslice(s, i) = i <= sizeof(s) ? unsafe_string(pointer(s), i) : throw(BoundsError, s, i)

No. That would not give an arbitrary slice — it would only work for the beginning of the string.

They are not valid, if you make are viewing the String as being UTF-8 encoded, which getindex does.

OK - we’ll have to agree to disagree - it seems very inconsistent to me, harder to use, and harder to understand than the alternative.

The problem is, is that it’s simply less efficient in basically all cases as far as I have been able to determine.

Yes, I’ve been asking for that for two months now.

OK - that’s a case where both using a variable name like rng (or at least r) instead of i, and a type declaration ::AbstractRange would have helped a lot in making that code understandable.

Does garbage collection not have a problem with one string pointing into the middle of another like that?
If so, why even bother timing the fastslice function, if it’s not going to be useful at all in practice?
(it doesn’t seem to really be faster than making a SubString by the older rules though).

I was just giving the minimal version for benchmarking. The “safe” version (which is ~just as fast in @inbounds code) is:

function fastslice(s::String, i::UnitRange)
    @boundscheck checkbounds(s, i)
    GC.@preserve s unsafe_string(pointer(s,first(i)), length(i))
end

The question, therefore, is not whether slicing a range of code units can be fast with the existing 0.7 codebase. The question is what API to export for this. Options:

  • Make s[i:j] slice a range i:j of codeunits. This is error-prone for the reasons @StefanKarpinski described.
  • Make s[i:j] slice a range of codepoints if i and j are valid indices, and a range of codeunits otherwise. This also seems error-prone to me: it might silently give unexpected results.
  • The 0.6 behavior (IIRC): Make s[i:j] equivalent to s[thisind(s,i):thisind(s,j)]. A bit odd since s[i] otherwise requires a valid index i.
  • Add a special range type for codeunit slicing, e.g. my suggestion of s[i:PrevInd(j)] covers what seems to be the most important use-case.
  • Make String(codeunits(s)[i:j])) fast. This seems like it would require deeper changes to the compiler to avoid the intermediate StringVector allocation, as I noted above, and doesn’t address the SubString case.
  • Add a separate function codeunitslice(s, i:j) or similar. A bit awkward, and we would also need a function for SubString.

I don’t really see how that is any more or less error prone than defining it the way it has been done in master.
i must be a valid index, and j must be == sizeof(s) or j+1 must be a valid index.