Redesigning String, optimising small strings and comparison

You should use the first byte of the pointer to signal the short form, since (at least for other uses, is an upcoming standard) it’s free (or can be):

Top Byte Ignore (TBI) is a feature of Armv8-a AArch64 that allows software to use unused pointer bits to store data without having to hide them from the hardware.

[This likely only works for loads and stores, but trivial to do for compares, or all other operations, of pointers, if/when not already done in hardware. For the GC you do not want TBI, i.e. ignore and load, rather checking, and only load when zeros.]

You can make an implementation that always uses the pointer, basically pointing to the old/current String type, and then it works as a proof-of-concept for the prefix (would still be a very useful package, just not ideal), and allows fast ordering, and all operations except for constructing the strings, and doesn’t lighten the load of the GC.

Though for ignoring the pointer, by the GC, when the top-eight bits aren’t zero, seems like a trivial change. It needs to be added to the GC, and for this string type for reusing/reinterpreting the pointer for a (part of the) prefix for the short form.

When all 16 bytes are zero then it’s the empty string, and otherwise the length of the prefix that non-zero gives the length of the string, unless the first byte is zero, and not all others, i.e. the long-form used (it would also be used for some short strings exceptionally).

There are many ways, from simplest 37% savings with 5 bits per ASCII letters (and 6 entries left over for punctuation or numbers, upper handled differently; possibly even 4 bits for only the most common letters…), to 2x with two ASCII letters compressed into a byte, using a lookup table, seemingly even 3 letters possible almost as easily in each byte, for 2.67 bits bet letter. I think fixed number of letters might be best, though needs not be, could be variable-length, i.e. BPE tokenization as in LLMs.

Compressing is a trade-off, it’s still O(1), and fast, I think fast enough, but I think still a speed-win in all cases (for real-world code), since it avoids a heap allocation and GC activity. Might be slower in some real cases if you only construct strings, and don’t do any processing on them, though likely not even then with given some GC activity.

You would need an escape hatch for an UTF-8 encoded prefix, also because you could just use that if (ASCII string) fitting 16 bytes, only compress for strings with a length of 17-32 letter etc. Even with the (ASCII) compression, I would likely have the first letter in UTF-8, to allow upper-case (alternatively a single bit to indicate it), and only the rest compressed (fully upper case isn’t common enough for 17-32 letter strings).

Despite this complexity, string equality check is just checking equality of UInt128 numbers, and e.g. isless (and startswith) is almost as simple, just shifting and some branch code, at least for the common case.

Concatenation isn’t slower assuming the first string is long enough, but even then not too bad.

But it’s not a real-world string, so I think we can just use the long form for all the problematic short strings. I do have in mind that you can use Strings for data, not text, so I might be wrong, at least using the long form would work, no less than the current design, just not as fast as the this design for all other short strings.

For the short form, you count the trailing zero bytes (or bits for compressed form; for 2x (or 3x) compression this is more difficult, then you may want to support a special case for odd number of letters).

UTF-8 can’t legally start with e.g. 0xFF, so it could let it indicate compressed strings (and indicate illegal UTF-8 with the long form). Or even just let 5 leading one bits indicate it, then left with at least 24 ASCII letters compressed. [For compressing everything into 8 bytes, this would give 11 plus letters, and only 8-16 bits taken from the prefix in the long form from the pointer, then length needs to be on the heap.]

Your prefix for long form overlapped your first letters of the data in the short form. I agree you should also have a (shorter) prefix for long form, just saying that the pointer (well its first byte) should align with the first bytes of the data in short form, because strings starting with a zero byte are rather uncommon (except for the empty string, if zero terminated, which is supported).

I do wonder if with things like compression, it’s over-complicating things a bit — at least for initial plans, it’s a cool idea.

I feel like at this point a reasonable sequence of actions is:

  1. Determine whether it could be worth renovating String
  2. Establishing a benchmark suite to evaluate implementations/options
  3. Brainstorming/implementing/testing a few layouts
  4. Taking the best + simplest idea, and seeing if that can be turned into a PR
  5. Explore potential further improvements (longer term)

I think so far this thread does (1) decently well. I’d like to proceed with (2)–(4), but I’m not sure how to best do so. Help would be brilliant.

@ScottPJones might you be interested in any parts of this?

Regarding memory layout, perhaps it would be worth trying to develop 1 word, 2 word, and 3 word string forms so we can compare the whole range of potentially reasonable options. It’s clear from the discussion so far that there’s far more than just a single “obvious” form, and likely some non-obvious tradeoffs that just need to be tested.

At this stage, I’m thinking maybe it would be worth creating a repo called something like StringExperiments to hold a collection of trial implementations + a benchmark suite.

Thoughts?

4 Likes

It would be helpful to fix the current code in Base that assumes all strings are UTF-8 encoded bytes stored contiguously in memory. For example, Switch supertype of CodeUnits from DenseVector to AbstractVector by jakobnissen · Pull Request #54002 · JuliaLang/julia · GitHub
julia/base/strings/string.jl at 0fade450a183470b01c656a9001512ef2f1aae47 · JuliaLang/julia · GitHub
`GenericIOBuffer` assumes data is stored contiguously in memory · Issue #54636 · JuliaLang/julia · GitHub

Another step would be to try and move most of StringViews.jl into Base.
It doesn’t make sense for every new experimental string type to have to reimplement iteration, parsing, regex, and searching functions to be performant.

I think this might require a new function in Base for getting an AbstractVector{UInt8} of UTF-8 encoded bytes from an AbstractString, which avoids copying in the common case where the string is already UTF-8 encoded.

That already exists:

1 Like

Nice!

But as far as I understand, this is only for identity/address-having mutable foreign types and therefore not applicable?

The thing one would want is more similar to

@fancy_type struct MyBigInt
    opaque::UInt64
end
function unwrap(someval:: MyBigInt)::Union{Int64, BigInt}
    if someval.opaque >> 63) == 1
        #this happens to be a small value
        reinterpret(Int64, someval.opaque << 1)>>1
    else
        #this is a bigint managed by some external library. The following needs to be a no-op for performance.
        unsafe_pointer_to_objref(reinterpret(Ptr{Nothing}, someval.opaque))::BigInt
    end
end

I.e. have allocation-free fast arithmetic for small integers, and dynamically promote to externally allocated memory.

Apart from knowing to use the custom mark function on MyBigInt instances – which may be inline allocated in arrays or other structs – codegen would need to understand how to properly root non-heap-allocated MyBigInt in gc frames, without boxing, right?

This could be custom registered code to unpack the type into references.

Or it could be a new stack-allocated instance (i.e. copy the payload with prepended object header onto the stack, put a pointer to the stack into the gc frame – then one could reuse the custom mark function).

PS. Add `TaggedRef{T}` / `TaggedPtr{T}` / `ImmediateOrRef{T}` · Issue #47735 · JuliaLang/julia · GitHub is a related discussion. Something like that is probably enough runtime/codegen support to make String2 possible to create in a package.

But as far as I understand, this is only for identity/address-having mutable foreign types and therefore not applicable?

Only so-far that nobody has added support for that. It should be feasible to add “immutable”, but yes then you suddenly face a codegen issue. You could perhaps still store the object inline, but unpacking it to the stack would be tricky.

The prefix seems like a lot of work for something that’s only situationally useful.

Here is a different proposed allotment in 128 bits

  • Inline: [ data (120b) | 0x00 (8b) ]
  • Heap: [ ptr (64b) | length (63b) | 0b1 (1b) ]

The inline version is defined by the presence of a null terminator in the 16th byte. To canonicalize the representation, all unused bytes are also set to null (i.e., an 11 byte string would have null in all 5 trailing bytes). The length is not stored because it is trivially computed in a few instructions (i.e., based on trailing_zeros). Length isn’t used that often and it’s so cheap to compute in this format that I’d rather have the byte.

In the heap version, we store a pointer and the length. We scavenge just one bit (or byte, if it makes a practical difference) from the last byte of the length field to tag it as a heap string, leaving 63 (or 56) usable bits for the length. A nonnegative Int64 only needs 63 bits anyway. Extracting the true length from this field is a simple bitshift to discard the tag bit. The pointer is fully intact and requires no masking or shenanigans to make it valid.

2 Likes

The problem with that is that if the pointer isn’t in the inline version too (it’s a valid design, just not the ideal final one), then you need some GC support (I believe), since it must know if you have a pointer to follow, or something else. It would never look at that last bit.

So you might as well steal from the leading bits of the pointer (assuming GC support), those bits will always be zeros, in legal pointers, for the next hew hundred years.

There’s no reason to limit yourself to 120 vs 128 bits for the letter data prefix.

The argument for the prefix is that it is really useful for ==, sort, and friends which are very commonly used string functions.

2 Likes

I’m currently working on a skeletal StringExperiments repo, as mentioned earlier.

Just let me know (here, in DMs, slack, wherever) if you’d like to get commit access to the repo and I’ll happily do so.

At this stage I’m thinking about how I can do step (2) well:

I’m going to try scraping all Julia package source code for semi-representative pool of strings (using JuliaSyntax).

Once that’s done I’ll see if that can be used to help make a nice suite of string-related benchmarks. I think this should be a mix of basic operations like ==, hash, and cmp, as well as some higher-level stuff like leftjoin! with a DataFrame.

If anybody has ideas on either the sourcing or benchmark creation, help would be brilliant :slight_smile:

Once that’s done (or even in parallel), brainstorming layouts + implementing them would be great too :grinning:

7 Likes

There are a few benchmarks at BaseBenchmarks.jl/src/string/StringBenchmarks.jl at master · JuliaCI/BaseBenchmarks.jl · GitHub

Yup, I’ll be grabbing those too

Actually your design doesn’t have the downside I mentioned, if the GC is never involved. I.e. if the string type allocates on the heap, and manages it, uses the pointer, but the GC never follows it. But that mean you, or at least the string type, must free it, implying manual-memory management (most) Julia users want to avoid, or an ownership model, or well reference counted strings. And RC isn’t too bad, and reduces GC pressure with early freeing (used for everything in Python, and Nim; and even Rust and C++?), and the downsides of RC are greatly minimized because of the short form, then not needed. But RC means the reference count needs to be in the heap, and implies locking.

Yes, and for that it needs not be large, can even be 1 byte. I.e. you test for inequality very quickly, and get a lot of bang for the buck for sorting, but not if your strings start with say “https://www.” why I’m thinking you really want at least a 13-byte prefix.

A lot of strings are never sorted, or even compared at all, just stored (and maybe later printed), so the overhead needs to be minimal.

But a larger argument for a prefix is when it fits all of your string, and then it’s a real win, no heap allocation.

I keep forgetting about the GC. I guess that would require some work for these union representations (but is required in most of the proposals here).

If two strings are already more than 15 bytes long, there’s a good chance you can decide != (which is the best a prefix can declare) by length alone. I think URLs are a poor example because they have an unusually high chance of prefix collision (unless the prefix is longer than the 8 characters "https://") and a low chance of a length collision (because they are somewhat long). UUIDs might be a better example, but aren’t those commonly stored in a non-string format?

When one is sorting strings, they’re usually short (enough to fit in the inline format).

The prefix is useful for sorting, and is (typically) stronger than length for != (although checking equality on long strings is rare), but there are a huge number of calculations where it isn’t useful. Meanwhile, one must write, maintain, and spend instructions on the code for handling it (and ensure that code is robust to all technically-legal uses).

The last byte is a null terminator '\0' (a required part of any C-compatible string and most of the proposals here). So the string itself is the full 16 bytes, it’s just there is no choice in the final byte. I “separated” that byte in my proposal to emphasize its role in discriminating inline strings from heap strings.

That’s a lot of bytes. Most of the proposals here are for 16 bytes total (because of call-passing considerations, if I understand correctly?) and that just isn’t feasible in that space (current OS’s typically require ~6 bytes for the pointer, if I recall correctly, although that number could increase to 8 in the future or in uncommon OS’s).

No, it isn’t you can use all 16 bytes for 16+ letters, for the short form (for the long for you have 8 bytes, yes, or possibly 9-10 if very clever; I meant 16 bytes of data inline, for short form, incorrectly used prefix for that, that term only really applies to the long form, and then 8 bytes discriminate (i.e. most often), and if not, then and only then do you look at the heap). Then you don’t need a pointer (assuming GC support, or managing the pointer yourself, can be done in a non-breaking way in a package, i.e. without changing Julia).

[It’s also perfectly fine to use 3 or 4 registers, for 24, and 32 bytes/letters total, as some designs for C++ do, but I don’t think we needed to, should use 2 or even maybe just one…, but that almost requires compression, some form, this or other: BPE could/would… compress e.g. https://www. to just 1 byte…]

No, that only works if you have the string on the heap, i.e. take a pointer to it (often you might construct the string in registers only and it might never at all go to memory, you could print end free it before). But you can’t take a pointer to register content, as I explained above, so Cstring needs to copy from the short form (but not the long form). C requires a pointer to the string, and yes, it may point to the stack even, but I believe it’s very dangerous to point to the stack in Julia (and C), since the stack frame might go away.

[Julia decides to escape analyse or not, i.e. to use the stack, but I’m not sure, I guess the LLVM desides to even eliminate from the stack, to put into registers. It means you can’t rely in point to the stack even (or the heap), and I’m unsure how C or C++ handle this, maybe some optimizations are simply ruled out, but allowed in Julia?]

I believe Cstring is mainly for file paths, and they are almost always longer than the prefix budget. In theory it’s for calling arbitrary C code (why would you call C string handling code when you can do everything as fast, even faster in C? Note, C++ doesn’t use just C strings), but I’m not sure it’s used that often, at least by Julia’s standard library never except file paths?

Right, but I’m not sure even == and != are that important (still well-served by the prefix, especially startswith, which is very important). isless is I think most important, and then knowing the length isn’t helpful.

The prefix is useful for sorting, and is (typically) stronger than length for != (although checking equality on long strings is rare), but there are a huge number of calculations where it isn’t useful.

I think Julia needs a borrow-checker eventually, and I’m thinking is there some way to allow owned strings and then needing a different type to pass strings up the call stack?

In the mean-time, or even better than Rust’s ownership model are generational reference from Vale:

Safe: It is the safest native language, thanks to generational references and Fearless FFI.

https://verdagon.dev/blog/generational-references

Now, instead of malloc returning a void*, let’s make a gmalloc that returns a “generational reference”.

Generational References are Fast

The very first versions of Vale used the above approach for its memory safety. We benchmarked it, and it outperformed reference counting pretty nicely, even with regions turned off.

Generational references have only 10.84% overhead, less than half the cost of reference counting!

It seems generational references can be made in a package (i.e. gmalloc, that calls malloc for you, and handle’s you freeand eliminates the (otherwise, potential)double-free issue), could be a separate topic/thread on that, or could be explored in the string type package, then split off when it works, into a separate package for more general use.

My counterproposal would be:

* Inline: [ data (127 or 128b) ]  # you don't need to steal one bit to indicate this for, but yes, that's one option.
* Heap:   [ ptr (64b) | prefix  56-64 b | length (8-14b) ]

Inline covers 0-16-string lengths. The length with only 8 bits for the long form, for the heap, only covers up to 255 bytes, seemingly, but actually 17 16 to 272 bytes/letters (e.g. all file paths, except those infrequent covered by the short form). 255 would be used to indicate the length is stored on the heap.

Alternatively: At a minimum you want 1-2 bytes prefix in the long form too, so, you can have (56- or) 48-bit lengths, if you prefer, for 256 TB strings max… The pro of this form is never storing the larger lengths on the heap…, and easier to take over longest strings (from C). Can still be done with double indirection, two heap allocations then.

Most strings are small, covered by inline, and if not, then most are still smallish, under 255 bytes, but if you don’t trust me, then you can increase length up to say 14 bits, for 16384-length strings, and those extra bits can be taken of the pointer and/or the prefix, your call. Having quickest access to non-heap-stored lengths isn’t actually that important.

With 14-bit length option, the prefix is down to only 7 or 8 letters, or 2x that (for at least 14 letters, already rather good) that or 3x with compression.

I’ll do more with this later, but I’ve come across a page that seems to nicely summarise + microbench a few different pointer tagging methods we might want to consider: What is the best pointer tagging method? • Core Dumped

The low-bit method in particular interests me.

Should we be interested in fiddling with the top bits at all, I wonder if we need to worry about Intel 5-level paging - Wikipedia though?

we shouldn’t use the top bits. 5 level paging is increasingly getting adopted and it seems possible that ARM will continue to innovate on pointer tagging where the top bits of the pointer become a hash of the base pointer address (so out of bounds accesses can reliably seg fault)

1 Like

5 level paging does give me pause, but some further investigation shows the current situation:

Architecture Feature name Availible top bits
ARMv8 Top Byte Ignore (TBI) 8 bits
AMD x86 Upper Address Ignore (UAI) 7 bits
Intel x86 Linear Address Masking (LAM) 6 bits, or 15 bits*

*depending on whether or not 5-level paging is enabled

This makes me think that it should be safe to use some top bits, no?

1 Like

All the top bits will be zeros (or at least with current libc, and the hardware can’t change that) in all memory allocated from malloc(?!). I’m unclear what you mean by the hash, it seems like a non-default malloc addition, that they want to support, and may promote into malloc, but wouldn’t be by default?

Which means we can mess with them, at least when avoiding GC, i.e. doing our own free. And as I explained we can, and I think should to it that way, not needing any change to Julia (for the string package, until merged into Base that is, then of course it’s a simple change to Julia).

Even if this hash gets adopted later, then if seems to be redundant info, and we could drop it and put in our own?

6 bits, or 15 bits*

Even if we can only rely on top 6 bits zeros, not 8, then that’s ok, only the strings with 0x01, 0x02, 0x03 will be affected/ambiguous and can be worked with, and they aren’t even commonly used.

If we can steal 15 bits from the pointer, then it still allows for 512 TB of memory allocations.