StringIndex idea (Julia 2.0)

Continuing the discussion from String indices : byte indexing feels wrong… The other day I had a somewhat clever idea for how to preserve the O(1) indexing benefits that Julia currently has, while making string indexing more intuitive. The idea is to define a StringIndex type like this:

struct StringIndex
    unit_index  :: Int # code unit index
    char_offset :: Int # character offset from there
end

The behavior of strings and operations is changed so that:

  • When indexing into a string with a plain integer, indexing is done by character even though this is O(n). Any place in real code where a plain integer would be used for this, it’s probably counting from the start of the string, and you’d end up doing an O(n) operation anyway.

  • Operations that produce indices into strings return StringIndex(unit, 0) objects instead of plain integers. If you add or subtract a plain integer value to a string index, it simply adjusts the char offset field and leaves the unit field unchanged.

  • When you index into a string with a StringIndex(unit, char) object, it goes to the unit index in O(1) time and then scans forward or backwards by char characters, which is potentially O(n) but in practice the offset will be small and you’d have done that many character scanning operations anyway.

The only down sides I can see to this scheme are:

  1. When someone looks at a StringIndex object it’s kind of confusing—what is this thing and why does it have two components?
  2. If take a StringIndex object derived from one string and index into another one with it, you will get garbage.

Both of these could be fixed if StringIndex captured the string it was derived from as well, but I worry about putting a reference to a typically heap allocated string object in something as simple as an index, but maybe our compiler is smart enough to mostly delete that these days in cases we care about. I.e. the definition would be like this instead:

struct StringIndex{S<:AbstractString}
    string      :: S   # source of the index object
    unit_index  :: Int # code unit index
    char_offset :: Int # character offset from there
end

It’s weird for an index value to keep a reference to the thing it indexes though.

7 Likes

Reminds me of the old julia#9297. @nalimilan: was performance the major limiting factor there? Sure seems like we should do much better on that front these days.

1 Like

See also Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type · Issue #9297 · JuliaLang/julia · GitHub

It doesn’t seem ideal to have every string index take up twice as much space (and have the resulting getindex(::String, ::StringIndex) be more expensive to boot. Even worse if you stick a reference to the String in there.

I seem to recall that this was a sticking point of #9297 as well — the ostensible simplicity of a specialized string-index type looks less attractive when you realize that the index must be implicitly or explicitly tied to a particular string.

1 Like

IIRC the limiting factor for that version of StringIndex without capturing the string object was ergonomics. The limiting factor for the version that did capture the string was performance, and I’m still reluctant to do something that relies too heavily on fancy compiler optimizations to be tolerably fast. If the byte index is the only field of StringIndex then what does StringIndex(code_unit) + 1 do? You can’t make it work since you need the string to compute a new code unit index. The solution I had previously thought of was to capture the string and then make StringIndex(str, code_unit) + k fast forward the code unit index by doing nextind(str, code_unit, k). The innovation here is that you don’t need to capture the string, you can just keep the character offset as a separate field from the code unit index and do the character advancing when you index. So you can avoid capturing the string object that way.

It creates yet another type — an OffsetStringIndex — that holds both fields for you.

In fact, we could introduce a handful of types here — a ByteIndex, a CharIndex, and a GraphemeIndex. And then an OffsetIndex could be parameterized by the offset type. Indexing by the Char- or GraphemeIndex would of course be O(N), but it’s quite explicit what you’re doing.

Just a spitball.

But currently we have a plain integer index that’s implicitly tied to a particular string, so I’m not sure that’s any better.

I agree that sticking the string reference in there feels bad. But I’m not sure that having every string index take up more space is actually a big problem. Storing string indices seems like a very uncommon thing—they’re mostly objects that only live for a little bit inside of a function and since they’re immutable and simple, the compiler should be very good at optimizing them away entirely where possible.

Regarding getindex(::String, ::StringIndex) being less efficient, that may be true, but if you look at the whole code I think it would be no less efficient. Here’s a fairly typical string function I found be grepping Julia for nextind usages:

function splitenv(e::String)
    i = findnext('=', e, 2)
    if i === nothing
        throw(ArgumentError("malformed environment entry"))
    end
    e[1:prevind(e, i)], e[nextind(e, i):end]
end

Instead of just an integer, i would be a StringIndex instead, and the last line could become e[1:i-1], e[i+1:end] which is exactly what everyone wants to write. This requires adding appropriate methods colon(::Integer, ::StringIndex) and such but it’s all doable. Does it get less efficient? I haven’t implemented any of this so I can’t prove it, but the extra work that getindex does for those ranges is precisely the work that is currently explicitly done by calling prevind and nextind.

Every single SubString object uses string indices internally, and you often store them and pass them between functions. But I guess they could use a lower-level internal API.

I worry about creating a zoo of nestable types like this. I’m not sure we need this for anything besides the code unit + char offset combo. Note that my proposed StringIndex isn’t even parametric, so it’s very easy on the compiler. One string index type for all strings. But having two types—one for a non-offset code unit index and another for a code unit index + char offset—could work. However, I think just representing that with a zero character offset would be simpler.

1 Like

Right, SubStrings could just continue to store the code unit index as a plain integer.

I like it. You could do all of this in a non-breaking way in a package, if you are ok with creating new functions analogous to Base (e.g. sfindnext) to return StringIndexes.

A potential downside is that the actual byte offset is computed lazily so if people overuse the char_offset it could be a performance issue. But then everything is a footgun if you use it wrong enough and point it at your feet.

A function to normalise a StringIndex to one with char_offset==0 (given the string) would be good.

1 Like

AFAICT I abandoned the idea of a StringIndex storing a codeunit index plus an offset in chars (as proposed here) for performance reasons (see this comment). Sorting a pointer was fast but raised problems. But the compiler may have improved since then.

4 Likes

It is somewhat appealing to have have the core StringIndex type not have any offset field does address both concerns about storage becoming bigger and getindex becoming less efficient: in the case where you haven’t done any additional seeking, it’s just a wrapped integer; in the case where there’s an offset you had to do that seeking work anyway.

But on the other hand, we might want to parameterize StringIndex{kind} where kind is one of :char, :grapheme or :grapheme_cluster and indicates what the meaning of the offset is. Why? So that you can have chars(str) produce StringIndex{:char}(index, 0) indices and graphemes(str) produce StringIndex{:grapheme}(index, 0) indices. Then when you do idx + 1 what the 1 increments depends on what kind of a view into the string object you’re working with. Equivalently we could just call those CharIndex and GraphemeIndex and GraphemeClusterIndex, but the point is that you want to capture in the type what kind of indexing you’re doing so that advancing from there can do the right thing.

Here’s an admittedly type heavy scheme:

abstract type StringLocation end
abstract type StringIndex <: StringLocation end

struct CharIndex <: StringIndex
    ind :: Int # code unit index
end

struct GraphemeIndex <: StringIndex
    ind :: Int # code unit index
end

struct CharLocation <: StringLocation
    ind :: Int # code unit index
    off :: Int # character offset
end

struct GraphemeLocation <: StringLocation
    ind :: Int # code unit index
    off :: Int # grapheme offset
end

Basic idea:

  • codeunits(s) wraps a string, iterates code units, has integer indexing
  • chars(s) wraps a string, iterates characters, has CharIndex indices
  • graphemes(s) wraps a string, iterates characters, has GraphemeIndex indices

If you do CharIndex + Integer you get a CharLocation compound location; similarly if you do GraphemeIndex + Integer you get a GraphemeLocation compound location. Why? This forces the user to be explicit about how they want to interact with a string—as a collection of code units, character, or graphemes. It also makes the API uniform: code that does e[1:i-1], e[i+1:end] like the example above can be changed from working with code units, chars, or graphemes simply by changing the string wrapper and nothing else. To me this is reminiscent of the matrix factorization design that let you just replace a matrix by its factorization object and leave the rest of the code unchanged.

I didn’t remember/realize that you had already proposed this exact idea. Funny that it wasn’t appealing at the time; now I can’t see what the performance issue would be.

+1 for thinking about this. I think it’s a good idea. Not necessarily the exact current implementation suggestion. But, really, drop the 2.0 reference - it never does any good for the discussion.

1 Like

The compiler is indeed typically smart enough to delete that reference. One alternative though, which might be an okay middle ground would be to capture the objectid of the string rather than the string itself.

This could cause incorrect collisions, but they’re pretty unlikely and the id would only be used for trying to identify and throw and error anyways. Just an idea.

It would be a breaking change, so it would have to be a 2.0 change.

Would it be possible to have a second string type (IndexedString, or a better name), with its coupled StringIndex, implementing this idea? If these are new types, the change is not breaking.

It’s also just the removal of the getindex(::String, ::Int) that’s most breaking. Perhaps changing the return types of index-returning functions would be too breaking to do, but it’s the kind of thing that might just be able to be supported as a minor change… especially if it could still be a <: Integer.

Or you could just add getindex(::IndexedString, ::StringIndex) for a new String type, leaving the current legacy getindex(::String, ::Int) as it is now.

It is always possible to deprecate an undesirable method, which would be better than just leaving it be.