StringIndex idea (Julia 2.0)

Palli · December 6, 2023, 12:20am

That part can be fixed. Simplest way:

struct StringIndex
    unit_index  :: Int32 # code unit index
    char_offset :: Int32 # character offset from there
end

It takes the same space as one Int index, i.e. Int64 on 64-bit (and well ok still double on 32-bit but who cares about such legacy).

The usual index is Int i.e. Signed, and I hear you complaining you can’t index your full string when they are huge. You can get one more bit with UInt32 to index 4 GB strings, but something more clever can also be done.

How large does char_offset need to be really? Int16 (or even just Int8?). You can have a one UInt64 type that will fit into one CPU register, and the last 16 bits (i.e. AX on x86) will be the Int16 char_offset and you can support 256 terabytes of strings, i.e. 64-16 bits, 2^48 by shifting by 16.

I have to think about this a bit more, I think this can be done in 1.x.

Why forward? The unit_index is always >= byte_index(?)

I’m confused, why would you do that? Is it when working with two strings side-by-side?

julia> "Páll"[4]  # index of former l in my name. You can index 2, but then not 3, so I don't see how an arbitrary index 4, or 3, is useful some other string, *in general*
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

tecosaur · December 6, 2023, 8:44am

(On topic?) I do wonder if getindex(::String, ::Int) could be be made to work on codepoints in \mathcal{O}(\log n) time using a skip-list.

foobar_lv2 · December 6, 2023, 9:59am

Regarding fast by-character indexing of strings:

If this is super important, then I guess the way I’d attempt to do it in a non-breaking way is the following:

On the API side, there needs to be some marker to distinguish the current by-address scheme, and whatever by-character scheme that we want to support
The String type gets slightly more complex: Small strings stay as-is, large strings get an extra 8 bytes added to their size (presumably at the end).
When you use the new by-char indexing on a small string, then we eat the O(n) cost. It’s a small string so whatever.
When the string is large, then we need a python3 style oversized char array somewhere. Large strings have the extra 8 bytes at the end of their allocation, so we lazily initialize that char array on by-character access.
Depending on quality of gc support, we can also occasionally clear that cached lazily-initialized extra character array, jvm SoftRef style.

The end effect would be: People with a clue about utf8 who eschew the new API incur very small overhead (large strings eat an extra 8 bytes, and GC needs to do some minor extra work for strings). People who use the new API get relatively fast random access on a by-character basis with relatively small overhead: The memory overhead over the python solution is at worst 2x (large string containing 2 byte characters as well as a single 1 byte character – now we store both utf8 and utf16 representations). The time overhead over the python solution is one additional memory indirection, or one small-string O(N) search (if “large string” means >64 byte then this is not even a loop on avx512).

mrufsvold · December 6, 2023, 1:38pm

Instead of storing a reference to the parent string, could Strings carry a UInt64 hash of their contents and StringIndex could inherit that? Then it could be used to index into identical copies of a string.

Palli · December 6, 2023, 8:33pm

That, storing UTF-16, doesn’t work except when the string only has 2 (or 1) byte characters. Python (actually uses fixed-length UCS-2 aka almost UTF-16 and) has the third option UTF-32 which is overkill, for full strings. Note, UTF-16 can be used for all strings, but then not O(1) because there can be surrogates.

Ideally we want O(1) indexing including for e.g. Chinese, and I thought up such a scheme, way back, with just UTF-8 (plus small overhead):

What you can do is have waypoints into say every 256th char (or 16th depending how much you want to reduce the memory overhead), and it’s similar to your redundant representation, except you can’t get a pointer to the UTF-16. Which I admit might be helpful for Windows…

Most often you iterate through the string, as Karpinski mentioned, from the front, or sometimes the back. So I though instead of the lazily built up waypoints, you could have a sliding window into the string in UTF-16 or UTF-32, say one cache-line worth (or maybe 256 bytes?).

I think cache-lines are now 64 bytes, but to be conservative assume only 32 bytes:

Thereof Int64 (or as I argued Int32, Int48 whatever) byte_index, and same-sized unit_index, and the rest 16 bytes are 4 UTF-32 letters.

If we get away with Int32 or even Int16 for the indexes (negative could be indexing from the end), then we have 24 or 28 bytes for letters, or 56 or 60 if we can rely on 64-byte cache-lines.

When I think about it we can skip the window, and just read from the original UTF-8 data.

It’s slower to populate it (could be skipped for indexing less than e.g. 32, for a fast path, still O(1)), but then the next few access can be faster because the windows is not variable-length (at least with UTF-32, or a flag could state if the UTF-16 is not).

My original idea was lazily spaced out waypoints, but it could also be e.g logarithmic, just the amount you can fit into one cacheline. If you store waypoints, in any way, then the unit_indexes are implied and you only store byte-idexes.

stevengj · December 7, 2023, 8:05pm

I don’t think we ever want to support this. None of the core devs are suggesting moving away from UTF8.

The question, as I understand it is whether to;

Make s[i] do slow character indexing when i::Int, primarily to make strings more intuitive for new users. Serious string code would eschew this, in favor of…
A new opaque StringIndex type (or multiple types) that represent an opaque O(1) index equivalent to the current indices. (e.g. eachindex, findfirst, etc would return this).
One of the new StringIndex types could include a character offset, so that you could do e.g. i + 1 rather than nextind and it would be reasonably fast to use as long as the offset is small.

(1) would be massively breaking, so this couldn’t occur until a Julia 2.0 in the distant future, so no one should get too excited about this right now.

Personally I’m not sure that making strings slightly more intuitive for your first few days using Julia, with not much benefit for any serious string-processing code, is worth the enormous upheaval this change would cause, but in the halcyon utopia of Julia 2.0 perhaps such mortal matters will no longer concern us.

stevengj · December 7, 2023, 8:13pm

One option that wouldn’t be breaking would be to introduce another syntax for character-based indexing and then make the indexing error message suggest this.

For example, we could make strings callable so that s(i) is a slow character-based index. The fact that it looks different from getindex should be a hint that it is different from an O(1) lookup.

(It could even return the i-th grapheme, which has the downside of being a string rather than a character, but people coming from Python may not notice. Despite the conceptual appeal of graphemes, however, it would probably be even more confusing if s(i) didn’t match the elements of collect. Users in their few two days of Julia, who this is really for, won’t know what a grapheme is.)

mbauman · December 7, 2023, 8:29pm

I wonder if many of the new-to-Julia challenges could be avoided (or at least placated) with a well-written error hint for StringIndexErrors. The trouble is that you don’t have much space and whatever you suggest will surely be copied and pasted into all sorts of places. Whether you suggest replacing s[i] with s[nextind(s, 0, i)] or collect(s)[i], both will likely end up verbatim inside a for loop.

mnemnion · January 31, 2024, 9:42pm

I’d rather not see indexing into strings messed with. A String is fundamentally a byte array, and indexing it by bytes is therefore the correct default. Julia doesn’t guarantee that the encoding in a String is correct, for one thing, and low-level String code wants to work with the bytes and cope with the encoding directly, for performance reasons (I’m working on a package which does this, which colors my opinion here).

The problems with naively assuming ASCII and getting wrong code are real, and I agree that they stem from the fact that the right way to do things is awkward. When i + 1 works for ordinary text, it’s all too easy to fall into the habit of using it; string[i:i+5] is much easier to write than string[i:nextind(string,i,5)] (I’ll admit I had to look up that syntax to make sure it was correct!).

What we could do, though, is have standard views into a string. So when you make a String, it’s byte indexed, and nothing about how Strings work changes. But you can make a CharString as well, and this is just a wrapped string, but charstring[3:5] gets you a string which is the third through fifth characters (codepoints really), not bytes. Easy enough to convert back to a string, just unwrap it, converting an existing String to a CharString is also cheap, immutable strings mean they can share references.

A nice property of this is that one can also make a GraphemeString when that’s actually what’s needed. Graphemes are inherently “expensive” because of the local information needed to determine boundaries, but a workflow where, say, a string is broken up into lines (cheap, byte iteration works fine), then @views (or something like it) is used to turn each line into a GraphemeSubString, then further processing can be done with confidence that you won’t accidentally break “” into pieces. This also allows for exotics like a TextWidthString, which would be a boon for doing terminal work. And there’s a single source-of-truth data structure unifying all of them, the String, indices remain what they are, Ints, we’re just using a wrapper struct to get the kind of indexing we need.

This has the additional advantage that it’s not a breaking change, it’s something which could be added right now.

mnemnion · February 2, 2024, 3:55pm

I wanted to follow up on this a bit. The obvious disadvantage of working with strings properly is that many operations go from O(1) to O(n). This isn’t so bad by itself, but its consequence is: many string algorithms of interest (most?) are O(n) given O(1) indexing, so they become O(n²), and that is bad news.

But we can recover that performance very cheaply, by having a CharString (or any of the varieties) cache the last index’s offset. So charstr[i:i + 5] isn’t scanning from the head of the string twice. If this is the first time we’ve indexed the string, it finds i from [1], but then for i + 5, the string view ‘knows’ the offset location of i, so it does an internal nextind(str,i,5).

This would mean that naive iteration using an index range would be as performant as a tuned iterator, and would shave a lot of time off most reasonable things we do with strings. Access is almost never truly random, so just keeping track of the last index point would be quite sufficient for most realistic string doings.

For very long strings, there’s a further optimization available, which is to lazily build a binary map of offsets. sizeof is already known because Julia strings (thank goodness) don’t rely on null termination, so we make an SVector with (example) three slots, with the byte offsets pre-registered as 1/4, 1/2, and 3/4 of the way down the string. Since we’re turning indexing into a scan, we fill in the user-visible index the first time we reach one of these offsets. I’m not even convinced this is worth it, just the last-index cache would give fairly good performance, but this guarantees an amortized O(log n) indexing, and you can’t beat that without a specialized data structure, one which either takes up substantially more space or makes the string incomprehensible to other languages, such as the C interface.

It might also be worthwhile to have index types which correspond to these categories, so a CharIndex for CharString and so on. There are some user-experience questions to answer about how the user gets this sort of index, when they’re returned, and so on, but these can be solved.

I brought them up to say that, for such an index, the .string and .offset fields should be treated as a cache. GraphemeIndex(10) should mean “the tenth grapheme of any string”. If you make GraphemeIndex(mystr, 10) it will give O(1) indexing to that point of the string, but if you use it to index any string, including not-grapheme views (or just vanilla String), it will return the 10th grapheme, if it’s in-bounds of course. This means we can do complex stuff like column alignment without having to do various ceremonies to “unwrap” e.g. a TextWidthIndex to use it on the next row. It has the further nice property that indices of any view type may be freely mixed with strings of any view type: this will always return the same character in all cases, just at varying degrees of efficiency.

foobar_lv2 · February 5, 2024, 12:50pm

What about multithreading?

Read-only access to most datastructures should be thread-safe (go away, splay trees). This especially applies to String, being immutable anyways!

Lazy cached values are OK in that context – e.g. caching hash of strings if we decide to spend the 8 bytes for that, because the common case will involve no atomics or contended cache-lines (see eg java String). But the last accessed index will update all the time!

stevengj · February 5, 2024, 1:00pm

I disagree with your premise that random codepoint indexing is “working with strings properly”. String algorithms that involve random access should be operating on codeunits (= bytes in UTF-8), not codepoints (Chars), and codeunits already provide O(1) access.

So far, in all the discussions of strings in Julia, to my recollection there has not been a single example of a practical string algorithm that requires random-access codepoint indexing (as opposed to a “pointer” to a previously traversed location, as in a search result).

The discussions on this thread about making the current string (codeunit) indices opaque and adding other index types have mostly been about making things more intuitive for new users, and preventing bugs due to mis-use of s[i+1] instead of s[nextind(s, i)] and similar. Not about an algorithmic need for fast random character indexing.

(AFAIK, no mainstream programming language other than Python3 tries to guarantee O(1) random codepoint access for its default string type, and Python3’s idiosyncratic approach comes with a lot of disadvantages.)

mnemnion · February 5, 2024, 3:53pm

I’m not sure what you’re disagreeing with, given that I said this:

But this also means a lot of algorithms which “work on my machine”, because they’re tested against the ASCII range of UTF-8 only. So someone writes a naïve slice of a string for five characters, and someone uses it on Greek and gets an invalid string. That can’t really be considered working with strings properly, can it.

The lede you quoted was an intro to some thoughts on how to get good performance while working with codepoints as though they were randomly accessible. At no point did I suggest blowing strings up into UTF-32, and I agree with you that it’s a bad language default.

mnemnion · February 5, 2024, 4:02pm

I don’t think this is a problem, because it can’t lead to incorrect results even in pessimal circumstances. If several threads are working on the same string on different ends, yes, the ‘cache latest access’ optimization will degrade, although the binary-map optimization wouldn’t. But it’s never going to give the wrong result, and getindex can use some pretty cheap math to decide whether to start it the front, end, or from the latest i. That is itself heuristic, in some exceptional circumstances the algorithm might pick the second-best starting point, and all of this is a good argument for some kind of binary map that kicks in for long strings (for some value of long).

It would be better for code where multiple threads are working on the same string to use the proposed CharIndex type, which carries its own byte offset. That kind of access could update the index cache, or not, either way it wouldn’t matter because getindex(::AbstractString, ::CharIndex) wouldn’t use it. But if someone writes multithreaded code which just uses naïve indexing into a CharView string, it won’t mysteriously break, at worst it will be slower than expected. That’s acceptable, because it will always be faster than not maintaining the cache, and thus starting from one end or the other every time the string is indexed.

mnemnion · February 5, 2024, 5:06pm

Continuing to think about how all of this would work together, and I’m convinced that a collection of AbstractStringIndex types would be essential if the AbstractStringView <: AbstractString approach were to be chosen. It’s not clear to me if AbstractStringIndex should have a numeric supertype or not, I lean toward yes.

As a motivating example, consider RegexMatch. To prevent breaking changes, nothing about a RegexMatch on existing Base string types can change, and nothing would have to.

But let’s say the user runs a regex on a GraphemeView, what should the values of .offset and .offsets be? If they’re Int, we’ve created a problem, because using them to slice or access the GraphemeView will give the wrong result. This could be corrected by counting graphemes to the byte offset and providing those numbers, but this is incredibly wasteful and slow.

So they would need to be a CodeunitIndex, which is merely an Int in disguise, one that getindex and friends on a GraphemeView knows to use on the underlying byte vector, not translate into a grapheme index.

One of the strengths of this proposal is that it’s a non-breaking change, so String operations which return an Int have to continue returning an Int. To accomodate the additional complexity, and provide a path forward, define a StringIndex type, which is Union{Int,AbstractStringIndex} (perhaps Union{Integer,AbstractStringIndex}) which methods can use as a drop-in replacement which is valid for more types.

Methods which have AbstractString and Int as an index in the signature should already behave correctly, since the premise here is that these new StringView types take an Int and interpret it according to the view. This is no different from packages which offer, for example, UTF-16 strings, where an Int is already not a reference into a byte vector (usually these are backed by a UInt16 vector). Such methods could simply change the Int to StringIndex to work with all the new index types, that’s normal with backward-compatible changes, code does have to be modified to work correctly with new features.

There are some remaining wrinkles about how to convert index types into one another, but this is getting long already.

stevengj · February 5, 2024, 5:23pm

s[1:5] is buggy code for an arbitrary s, certainly. But it seems more like a programmer/coding problem than a language one.

“Give me the first five codepoints of a string” is possible to do properly in Julia right now, via s[1:nextind(s,1,4)]. Admittedly, that syntax is a little awkward, and if this were an important/common operation for arbitrary Unicode strings then it would be good to have a nicer syntax. But is it an important or common operation?

I would argue that any programmer who thinks “give me the first five codepoints” is a good generic operation to perform on an arbitrary string (as opposed to “give me all the characters before the first whitespace” or similar operations resulting from a string-specific query) is probably misunderstanding how Unicode works.

(This is what happens with almost every discussion here about character indexing in Julia. Someone comes along and says “I want to do x and it doesn’t work”, and it usually turns out that x is not a practical task on Unicode strings to begin with. At least with Julia’s indexing, programmer mistakes like this throw an error the first time they encounter a Unicode string, rather than silently lurking until they encounter a combining character.)

The practical cases where an opaque StringIndex type would help, in my opinion, are things like “I searched for a regex and now I want the next/previous character”, where it is tempting to do a buggy ± 1 rather than nextind/prevind.

Dan · February 5, 2024, 5:29pm

should be s[1:nextind(s,1,4)] perhaps?
This correction should probably be a good argument…

stevengj · February 5, 2024, 5:32pm

Right, fixed the typo. Again, if this were an important/common operation, I agree that a nicer syntax would be helpful.

In fact, we do have

using Unicode
graphemes(s, 1:5)

which is a lot nicer and is probably closer to what people want when they say “give me the first five characters”. (It still doesn’t seem to have much practical utility, but it least it lets us give a compact answer to such queries.)

goerz · February 5, 2024, 5:49pm

The one situation where I’ve come across this in my own packages is for pretty-printing tabular data including math-unicode to stdout. That is, the header including ∫gₐ(t)dt in the tabular output shown at the bottom of Optimization of a Dissipative Quantum Gate · Examples. I haven’t had to do anything with graphemes in the Julia version of this code, since I haven’t gotten around to making that output customizable. But as soon as I bring that feature over from the Python version, I’ll have to do string processing based on graphemes there as well to make sure everything lines up neatly in columns.

All that being said, while I think Julia’s string handling takes a little getting used to, it is a fundamentally sound design, and I would resist changing it in Julia 2.0. If you need to process strings in terms of user-perceived characters, that’s exactly what graphemes are for. And graphemes are certainly not a typical use case of string processing.

stevengj · February 5, 2024, 5:56pm

That’s still wrong, because not every grapheme has the same width (even in “monospaced” fonts). So the Python code you are linking is buggy.

You really want to do string processing based on textwidth. (And even that is imperfect, because not all terminals agree on the widths of every Unicode symbol.). Even then, this doesn’t involve random access.

But I agree that the _rjust function in the code that you linked is a useful operation and tricky to implement correctly. Fortunately, Julia already has just such a function, done (more) correctly, the lpad function:

julia> "A\u0302"
"Â"

julia> lpad("A\u0302", 2)
" Â"

PS. To be clear, I’m not saying “never use graphemes”. Graphemes are important, e.g. for cursor motion. But “give me the m:n graphemes of an arbitrary string” random-access pattern seems to rarely if ever come up in real applications, as opposed to answers to forum posts.

Topic		Replies	Views
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1406	December 5, 2023
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2494	January 12, 2024
ANN: StringUnits.jl Package Announcements question , announcement , strings , indexing , units	8	598	March 28, 2024
Indexing Unicode Strings Internals & Design	10	1724	June 4, 2021
New to Julia / lost in translation / how to index a simple 2d-array of strings New to Julia question , strings , arrays	7	688	October 11, 2022

StringIndex idea (Julia 2.0)

Related topics