You should use the first byte of the pointer to signal the short form, since (at least for other uses, is an upcoming standard) it’s free (or can be):
Top Byte Ignore (TBI) is a feature of Armv8-a AArch64 that allows software to use unused pointer bits to store data without having to hide them from the hardware.
[This likely only works for loads and stores, but trivial to do for compares, or all other operations, of pointers, if/when not already done in hardware. For the GC you do not want TBI, i.e. ignore and load, rather checking, and only load when zeros.]
You can make an implementation that always uses the pointer, basically pointing to the old/current String type, and then it works as a proof-of-concept for the prefix (would still be a very useful package, just not ideal), and allows fast ordering, and all operations except for constructing the strings, and doesn’t lighten the load of the GC.
Though for ignoring the pointer, by the GC, when the top-eight bits aren’t zero, seems like a trivial change. It needs to be added to the GC, and for this string type for reusing/reinterpreting the pointer for a (part of the) prefix for the short form.
When all 16 bytes are zero then it’s the empty string, and otherwise the length of the prefix that non-zero gives the length of the string, unless the first byte is zero, and not all others, i.e. the long-form used (it would also be used for some short strings exceptionally).
There are many ways, from simplest 37% savings with 5 bits per ASCII letters (and 6 entries left over for punctuation or numbers, upper handled differently; possibly even 4 bits for only the most common letters…), to 2x with two ASCII letters compressed into a byte, using a lookup table, seemingly even 3 letters possible almost as easily in each byte, for 2.67 bits bet letter. I think fixed number of letters might be best, though needs not be, could be variable-length, i.e. BPE tokenization as in LLMs.
Compressing is a trade-off, it’s still O(1), and fast, I think fast enough, but I think still a speed-win in all cases (for real-world code), since it avoids a heap allocation and GC activity. Might be slower in some real cases if you only construct strings, and don’t do any processing on them, though likely not even then with given some GC activity.
You would need an escape hatch for an UTF-8 encoded prefix, also because you could just use that if (ASCII string) fitting 16 bytes, only compress for strings with a length of 17-32 letter etc. Even with the (ASCII) compression, I would likely have the first letter in UTF-8, to allow upper-case (alternatively a single bit to indicate it), and only the rest compressed (fully upper case isn’t common enough for 17-32 letter strings).
Despite this complexity, string equality check is just checking equality of UInt128 numbers, and e.g. isless
(and startswith) is almost as simple, just shifting and some branch code, at least for the common case.
Concatenation isn’t slower assuming the first string is long enough, but even then not too bad.
But it’s not a real-world string, so I think we can just use the long form for all the problematic short strings. I do have in mind that you can use Strings for data, not text, so I might be wrong, at least using the long form would work, no less than the current design, just not as fast as the this design for all other short strings.
For the short form, you count the trailing zero bytes (or bits for compressed form; for 2x (or 3x) compression this is more difficult, then you may want to support a special case for odd number of letters).
UTF-8 can’t legally start with e.g. 0xFF, so it could let it indicate compressed strings (and indicate illegal UTF-8 with the long form). Or even just let 5 leading one bits indicate it, then left with at least 24 ASCII letters compressed. [For compressing everything into 8 bytes, this would give 11 plus letters, and only 8-16 bits taken from the prefix in the long form from the pointer, then length needs to be on the heap.]
Your prefix for long form overlapped your first letters of the data in the short form. I agree you should also have a (shorter) prefix for long form, just saying that the pointer (well its first byte) should align with the first bytes of the data in short form, because strings starting with a zero byte are rather uncommon (except for the empty string, if zero terminated, which is supported).