Redesigning String, optimising small strings and comparison

As I understand it, all zeros is expected for userspace memory unless TBI/UAI/LAM is applied, and then (Linux) kernel-space uses all-ones.

My takeaway is that we can reasonablly just steal 6b, which if it was all used for length would give us 256 GiB (without encroaching on the prefix). I’m beginning to wonder if it would be worth exploring 3-word options…

Stealing 15 bits gives 512 TB (total) memory allocations, and double for each less bit, so I’m not sure where you got 256 GB, you calculated 256 petabytes (1025*1024 times more) and made a mistake…?

julia> Base.format_bytes(2^(32+6))
"256.000 GiB"

The length of the string in the long form could be part of the memory the pointer is pointing to. The first 8 byte of the data could be the length for example.

This way, ~4 GiB would not be the limit. The sorting could maybe even be a bit faster as the prefix could be larger.

As this would only use an 8 byte length for strings larger than 16 code units (depending how you count), it would be guaranteed to always waste ≤50% of memory (length + pointer).

If representing large strings is an issue, maybe we can take a cue from LZO and use the first byte of the length register to signal how many bytes are needed to describe the full length of the string? For example, if the layout of the 128 bits is:
[ length + prefix (64b) | ptr or data (64b) ]
Then you have the flexibility to store up to 15 bytes inline by using one length byte and 15 data bytes:
Strings of length 0-15: [ 0b0000 | length (4b) | data (15B) ]
The code can detect if the string is stored inline by checking the high four bits of length.
On the other end of the spectrum, if the high bit of length is set, you can treat the remaining 63 bits of length + prefix as the length, then follow the 64 bit pointer at the end of the struct to the heap for the data:
Strings up to length 2^63-1: [ 0b1 | length (63b) | ptr (64b) ]

The three starting bit codes between these two extremes (0b0001, 0b001, and 0b01) can be tuned depending on how many bytes of prefix string you want to store on the stack versus how long the full string is on the heap. For example, the code 0b0001 could signal that the low four bits of the first byte plus the next byte constitute a 12-bit string length, leaving you with six bytes of prefix and 64 bits of pointer, like so:
[ 0b0001 | length (12b) | prefix (48b) | ptr (64b) ]

The code 0b001 could signal that the low five bits of the first byte plus the next two bytes constitute a 21-bit string length, leaving you with five prefix bytes:
[ 0b001 | length (21b) | prefix (40b) | ptr (64b) ]
… and so on.

1 Like

There’s no need to have length stored for the short form, in a separate byte.

Legal UTF-8 can’t start with five 1 bits, so that can signal whatever you want.

This is overkill. You don’t need strings (for text, or data), basically files of that size. 56-bit length gives you 64 PB max. But I like your variable-length length proposal, probably some three levels are enough.

While you want the string type to eventually not use mutable struct, since they live on the heap (I mean the struct it self, why you can get a pointer to it, plus the other heap allocation for heap_string), though I (also) try it for an experiment:

julia> struct mystring
         prefix::UInt64
         heap_string::String
       end

julia> mutable struct mystring2
         prefix::UInt64
         heap_string::String
       end

julia> reinterpret(UInt64, str2.heap_string)  # you have access to prefix, it's isbbits but not the rest of the struct or it as a whole
ERROR: ArgumentError: Source type for `reinterpret` must be isbits

julia> str = mystring2(0, "Palli"); p = pointer_from_objref(str)
Ptr{Nothing} @0x00007f0eea022630

julia> p2 = pointer_from_objref(mystring(0, "Palli"))
ERROR: pointer_from_objref cannot be used on immutable objects

for good reasons, but I can bypass (not in general):

julia> ccall(:jl_value_ptr, Ptr{Cvoid}, (Any,), mystring(0, "Palli"))
Ptr{Nothing} @0x00007f0ecba97e90

julia> p[1]  # EDIT: solved below, was: I'm not sure how to deref the pointer, but by fixing I should be about to access the prefix, or its bytes, and the pointer with something similar to p[2]?
ERROR: MethodError: no method matching getindex(::Ptr{Nothing}, ::Int64)
julia> unsafe_store!(reinterpret(Ptr{UInt64}, p), 123)
Ptr{UInt64} @0x00007fb02b427c70

julia> str  # so far so good I can mess with the prefix:
mystring2(0x000000000000007b, "Palli")

julia> unsafe_store!(reinterpret(Ptr{UInt64}, p+8), typemax(UInt64))
Ptr{UInt64} @0x00007fb02b427c78

julia> str  # predictably, and what I actually aimed for:
mystring2(0x000000000000007b, 
[1200858] signal (11.1): Segmentation fault
...

Another way to crash it sooner:
julia> GC.gc()

[1201791] signal (11.1): Segmentation fault
gc_mark_obj8 at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gc.c:1862
gc_mark_outrefs at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gc.c:2641 [inlined]
..

[That file is gone in 1.12, so I’ll look into “fixing”/make work in 1.12, and it may never work for 1.10 LTS too. Or 1.11, though simple to backport to either I guess, not a priority to look into, I think the function simply moved.]

In 1.12 it’s the same (or similar looking) gc_mark_obj8 function, just moved to a different file, but actually in a line a bit lower:

[1204152] signal 11 (1): Segmentation fault
in expression starting at REPL[4]:1
gc_mark_obj8 at /cache/build/builder-amdci4-5/julialang/julia-master/src/gc-stock.c:1566 [inlined]
gc_mark_outrefs at /cache/build/builder-amdci4-5/julialang/julia-master/src/gc-stock.c:2327 [inlined]

I want to mess with the pointer, but also reliably for an immutable struct, to get a segfault, so I can fix the GC to ignore illegal pointers.

Unlike C strings, Julia strings can contain NUL bytes so me do need an explicit length.

They can also hold invalid UTF-8 data.

4 Likes

It would be a valid design choice for us to not be able store invalid unicode strings inline. They are pretty rare so if that makes everything else faster, it’s probably worthwhile.

2 Likes

Valid design choice, sure, but also quite breaking. It’s useful to be able to read & handle that kind of data on ingress.

How would it be breaking? You’d still be able to read that data, it wouldn’t use the inline string representation.

Apologies, I missed the “inline” bit; I thought you meant not storing invalid UTF-8 entirely.

Still, if the value for inline strings is being able to store them on the stack/in a struct just because they’re short, it’d be quite unfortunate to cause an allocation depending on the data you want to put there. That would kind of defeat the purpose of the optimization, no?

I think there are a lot of interesting design tradeoff questions that come down to “is it better given the mix of strings that we actually see in practice?” where the answer comes from a collection of benchmarks.

I think at this stage we’d be very well served by a nice suite of benchmarks to compare potential implementations with. I’ve currently got a nice string corpus from literal strings in registered julia packages (repo should be up sometime this week), but I’d like to work out more that can be done with them.

So, I’m considering making another thread (maybe in #GeneralUsage?), asking for community help putting together a nice collection of string-oriented tasks.

6 Likes

No, we only need implicit length, and we can, and I think should, actually go for 8 bytes only, i.e. store everything in the pointer.

  • Inline: [ data (up to 64b) | 0x00 (8b) terminator, except when all 8 bytes used ]
  • Heap: [ 1 0 | length (7b) | prefix (8 bits) | ptr (47b) ]
  • Heap: [ 1 1 | length (31b) | handle (31b) ]

This allows pointers to address 128 TB, i.e. a lot of room for strings up to 128 bytes in length each. Or probably up to 128+16 = 144 bytes.

The last form allows up to 2 GB strings and 2 billion such strings in memory at the same time.

You can actually use the middle form for longer strings and/or up to 128 TB strings (for one or all in total at least), but then you need length = 127 to be special-cased to mean the length stored on the heap.

This allows all short ASCII strings inline, up to 15 letters, or 16 if we want to sacrifice the null-terminator (I still think we should, at least in the full implementation, but then it’s a bit more complex to know which form of string you have).

And that’s perfectly fine as I (and Oscar) explained, they don’t all need to fit in the inline form.

[Outdated: I meant to simply answer, but it seems I accidentally overwrote my previous answer, probably my fault, not Discourse error. I’m not sure I can recover or see my old answer. Now it’s done.]

@Palli I’m not sure which post you overwrote, but you should be able to recover your old post by clicking on the pencil symbol at the top of the post. It will show your edit history!

1 Like

This is 9 bytes to my accounting?

Also, how do you imagine handles to work here? That is, how would handles be accessed and allocated?

Does your plan include an additional level of indirection on access, or is your goal to have a special allocation class for strings in the third category (i.e. reserve a contiguous 1<<31 bytes of virtual memory on startup, and its more pointer compression than “handle” style)?

Right! I made a miscalculation in allocating the bits, so I simply edited my post and dropped the prefix in that third format! It’s a nice feature to have, the alternative would be to take some bits off the length and/or handle, and even leave just 4 bits of prefix could actually be helpful, though ideally it would be 14 bits+.

  • Inline: [ data (up to 64b) | 0x00 (8b) ] # terminator, except when all 8 bytes used. It can be even just down to 4 zero bits in an alternative proposal.
  • Heap: [ 0 | length (7b) | prefix (16 bits) | ptr (40b) ] # first 8 bits of the pointer must be 0, in fact any consecutive 8 can be zero… except last eight is not enough to signal this format.
  • Heap: [ 1 | LEB-encoded length (up to 33b) | ptr (min. 31b) ] # some of the 64 bits must include 8 zeros in a row, just not only last 8.

Memory must be manually managed. The middle format can likely be eliminated, redundant with LEB, while still keeping (some) prefix in the third format, and having useful (max) lengths for the fields.

My first idea was that the pointer could be variable-length from say 32-bits up to 64-8 bits, and the 0 terminator would come in the middle to know where the prefix ends and pointer begins, but the miscalcuation came when also in adding the length.

That plan would also work, and then you have at least 4 GB of space for strings, in total for all of them for the shortest pointer length. I actually though of reserving the bottom 4 GB of virtual address space too, to not have to share it with other data, but it’s actually not needed if the pointer is variable-length, for the proposal when not with just 2-3 possible fixed lengths.

My idea with handles, yes, for the longest strings was two indirections, so that I can always include long lengths, but they can be variable-length too, e.g. LEB integer encoding. So you can actually get away with just regular malloc, and no extra indirection, just pointers never handles.

(Unsigned) LEB or any similar appropriate:

This thread now exists in General Usage :slight_smile:

2 Likes

I think you meant BigFloat, BigInt is already not based on Strings, and while BigFloat currently is, that’s going away with this PR (with a “merge me” status):

mutable struct BigFloat <: AbstractFloat
..
    # _d::Buffer{Limb} # Julia gc handle for memory @ d
    _d::String # Julia gc handle for memory @ d (optimized)

to:

struct BigFloat <: AbstractFloat
  d::Memory{Limb}

[That said, I would want both types out of default Julia, for Julia 2.0 (just still available in an upgradable package), at least one of them, it’s not even the fastest type, with better alternative in a package, so people shouldn’t always rely on Base, it might lead them down a slower path.]

I.e. I think we can safely concentrate on just regular text strings, ASCII only for the short form (and/or legal UTF-8), and while we must also (still) handle illegal UTF-8 too, in some way, i.e. arbitrary byte data, it doesn’t need to be handled in the short form, or be as optimized.

FYI: That PR is now outdated since related to BigFloat/MFPR, but I’m not sure why it wasn’t merged in 2021, if it was a good optimization for it, likely short BigFloat (and BigInt not as important)? I didn’t take a good look at it, but note it wasn’t too localized, also changed gc.c and julia.h.

This sounds very interesting. I’ve had the same experience that string work in Julia - unlike numerical work - tends to be slower than Rust. This is not due to the memory layout I think, but having a hybrid inline/heap allocated string type may be a good idea.

What happens if a user calls pointer on a string which is in the short form? I suppose it would need to be heap-allocated, which implies that long-form strings can also have length < 14 which will complicate various implementations. This also makes pointer an “expensive” operation for strings, since it needs to check the string type and potentially do an allocation - which again, will affect stuff like inlining.

Overall, I’m a little skeptical that it’s a good idea to change the generic String type to be (much) more complex. Most benchmarks might speed up, but some will slow down, and then it’ll be frustrating that the slowdown is due to a complex memory layout you don’t need or want. Also, it means that some seemingly complex operations like “get a pointer to this string” or even “get the length of this string” will be surprisingly complicated, which is frustrating in itself, and also may make optimisation of user string code harder.

One solution is to make a GermanStrings package. This might still be the best idea. However, this comes with two big problems:

  1. Few people are going to use the package. It’s much better to make String faster, as it’s going to have 100x more users.
  2. It’s going to be very, very hard to make a new subtype of AbstractString and be anywhere near confident that all optimised String methods also have an optimised equivalent for your new type. Practically speaking, unless that issue is addressed specifically and with a plan, it’ll end up being an unsystematic mass of fast methods with lots of performance holes that people will fall in. And then users will find that using your type often makes their code slower.

For what it’s worth, my experience with Julia’s string performance has mostly been that

  1. It’s difficult to avoid allocating tonnes of small strings in intermediate steps
  2. SubString sometimes hits slow, generic code paths compared to String, or will hit paths that cause them to be converted to String
  3. I almost always end up reaching for StringViews to avoid needless allocation, which also tend to hit slower code paths than String

The issue of hitting slow fallbacks is more encouraging, IMO, as it’s “just” that a lot of work, and maybe some more rigorous internal interfaces to allow fast new AbstractStrings that is needed.

Not to be a party pooper, but given the above, I believe to make strings faster in Julia, a necessary first step might be to take a look at the internal string interfaces and dispatch patterns, and to make sure that Base has the necessary abstractions to allows users to make new AbstractString types reliably fast.

Edit:
To brainstorm some more memory layouts:

  • It has previously been discussed to store whether the string is ASCII in a bit of the string data. This also complicates implementations, and will make string creation slower, but many subsequent operations are faster on ASCII strings, and most strings are ASCII. Most operations that create new strings from existing ones can also easily compute if the new string is ASCII, so ideally, the cost of checking every byte really is paid only once.
  • I wonder if storing the prefix might be a good idea for all strings, even if it’s decided that a heterogenous string layout won’t work. Of course, this will just make strings larger without the ability to skip the heap allocation, so I’m not certain.
4 Likes