I took a look at it, looks nice.
Note: I might recommend putting the 4 bits of length at the top of the 16-byte structure.
That might make life a bit easier if you later want to use SIMD instructions that operate on 16-byte chunks,
or also for working on 8-bytes at a time (because it will keep the bytes aligned). Also, I think it makes it easier to deal with storing 7 16-bit characters, or 3 32-bit characters), everything will be aligned, and less shifting and masking necessary. (i.e. len = (x >>> 124)
gets the number of code units out of a UInt128
.
You could even use 4 more bits to indicate things like whether the string is valid, if it is binary data, ASCII only, Latin1 only, UTF-8, UCS2, UTF-16, or UTF-32.
You could even pack 5 Unicode codepoints into a UInt128
, by only using 21 bits for each
which would still leave you 23 bits for “metadata” about the string.