That part can be fixed. Simplest way:
struct StringIndex
unit_index :: Int32 # code unit index
char_offset :: Int32 # character offset from there
end
It takes the same space as one Int index, i.e. Int64 on 64-bit (and well ok still double on 32-bit but who cares about such legacy).
The usual index is Int i.e. Signed, and I hear you complaining you can’t index your full string when they are huge. You can get one more bit with UInt32 to index 4 GB strings, but something more clever can also be done.
How large does char_offset need to be really? Int16 (or even just Int8?). You can have a one UInt64 type that will fit into one CPU register, and the last 16 bits (i.e. AX on x86) will be the Int16 char_offset and you can support 256 terabytes of strings, i.e. 64-16 bits, 2^48 by shifting by 16.
I have to think about this a bit more, I think this can be done in 1.x.
Why forward? The unit_index is always >= byte_index(?)
I’m confused, why would you do that? Is it when working with two strings side-by-side?
julia> "Páll"[4] # index of former l in my name. You can index 2, but then not 3, so I don't see how an arbitrary index 4, or 3, is useful some other string, *in general*
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)