Itβs possible, but it might still not be fastest if you want to index to a Char (or a grapheme cluster).
The Char type is 64-bit (built on UInt64, e.g. CPU register), but in Unicode a Char is guaranteed to never be larger than 21-bit, so if you want to hack something you could fit 3 Char
s into each UInt64. In very many cases you are dealing with BMP (UCS-2), then you can fit 4 into a UInt64 (what Python does in such a case, it has currently three different encodings in memory, Latin1, that, and UTF-32, what youβre asking for, just in case, and was least useful until recently when emojis got popularβ¦).
IF you know your UTF-8 string (all of its Char
s) is actually an ASCII string (itβs a subset of UTF-8), then you could index into the String
, just as it is (to a byte), then there will be no invalided indices. You would only have to check once, or not if for some reason you just know this reliably. If itβs very probable but not always ASCII you could keep a flag (is_ASCII) and use byte indexing if true, O(1), otherwise suffer O(n), rarely hopefully, and no copies.
It would be nice if Julia strings actually had such a flag, or is_potentially_NOT_ASCII, that could be automatically set to true until proven otherwise. Note Julia strings are NOT guaranteed to be even UTF-8! Thatβs the encoding Julia assumes, but when you read from a file it could be invalid UTF-8 too. You can iterate over it, and itβs your responsibility to validate it (thereβs a Julia function to do that, after input, but the result isnβt stored after, in a potentially useful to have flag is_potentially_INVALID_UTF-8).
I (and others) didnβt think indexing to a Char useful enough, but there is an O(1) way for UTF-8 strings. You could make another auxiliary vector of byte-indexes, then you can O(1) index into that one, likely get away with storing the index there as UInt32, and then to index to the string you had. It will save memory in all cases over your original plan but you need to index twice.
If you think the memory use is too much for such an index, then itβs viable to store in it say the index of every 64th letter in your string. Then you have 1 bit overhead per letter in your original String, and you index to a cache-line of information. Yes, you need to scan max 64 letters after, but itβs still O(1), since a constant amount. And you can skip the 3 in one UInt64 Char idea. You could lazily store this auxiliary index, even enlarge it lazily.
If you want to index to a grapheme cluster (e.g. [flag] emojis), since they can be very long, the above idea is the most practical.