Redesigning String, optimising small strings and comparison

tecosaur · October 1, 2024, 9:32am

To me this is all the more reason to have a good set of benchmarks: so we can identify regressions at this stage instead of having an “oops” moment later.

Yea, this has been pointed out, and I’m interested to see what the impact may be. Separately, there are some suggestions that have been raised on how this could be avoided. (also: I think you may have meant “seemingly simple”)

Might you have any nice pointer-heavy examples of String usage?

This does very much sound like a big deal for Julia’s string usage too, it was briefly mentioned in the most recent Triage too (we briefly discussed this topic). I’m half tempted to respond with por qué no los dos? though .

This is one of the things I want to experiment with too, glad to see it’s got legs enough that it’s at least occurred to someone else

This is an approach I’d like to investigate if it turns out that having a fast pointer method is vital and needs to be kept simple. I suspect it could be worth us doing something like this regardless of the small string optimisation.

Thanks for the thoughts and brainstorming

Palli · October 1, 2024, 10:00am

I now propose 7 byte strings, because of a problem I realized:

julia> struct problem
         b::Bool  # The ordering doesn't matter, Julia and other languages will pad anyway
         my_string::String
       end

julia> sizeof(problem)
16

Sizes of structs are important, padding crowds out caches, L1 and up, see data-oriented programming.

You might expect that a Bool takes 1 byte (or even 1 bit), but it takes 8 bytes, while it can take only 1 byte with:

julia> struct my_7_string  # will be reinterpreted as a pointer for the long form, otherwise be data, it's getting rather small for a prefix and length and a pointer, but could work.
         _1::UInt8
         _2::UInt8
         _3::UInt8
         _4::UInt8
         _5::UInt8
         _6::UInt8
         _7::UInt8
       end

julia> struct no_problem
         b::Bool
         my_string::my_7_string
       end

julia> sizeof(no_problem)
8

It gets even worse with:

julia> struct large
         a::UInt64
         b::Ptr{UInt8}
       end

julia> struct problem2
         b::Bool
         my_string::large
       end

julia> sizeof(problem8)
24

That was why I no longer aim for fitting in two registers, was going for one and now 7/8th of one.

This is actually a potential problem for all pointers in Julia, and they should arguably all be changed to 7-byte ones. Also for [U]Int64, it’s also overkill, in most cases. I’ve been thinking for a long time that Int32 should be Julia’s default (then needing overflow checks, but a 7-byte compromize can be a good alternative.

How low can you go? Reminds me of the classic:

The problem I see with a new string type, the problem2 struct above, are all the non-string operations, i.e. uses of you struct that do not even target you string, but as I explained that worry can be eliminated.

You might think that two registers is doubly slower than current strings that target only one register, but for all strings that live on the heap, it’s a memory saving, and thus faster (also for other reasons). And when inline all the operations are O(1), thus I don’t see it as other than win-win.

foobar_lv2 · October 1, 2024, 10:27am

I like that a lot! Might also want to store whether the string is valid utf8, and whether its valid C-string (i.e. null-terminated without interior null-bytes). Might also want to store a (truncated) hash value.

We can steal upper bits from the length of that, and fold the computation into the relevant memcpy calls. If we take another 2-4 bytes, then we can store a complete hash value (steal some upper bits from length plus 2 extra bytes; realistically 48 bits of hash is enough entropy, and 48 bits of length is enough). That of course stands and falls with the extra cost of computing this in the runtime. Presumably one would need clever SIMD code to make that fast enough.

This is a potential problem for everything and is the same in languages like C or java. It’s called alignment and structure padding, and it is up to the person who types struct to think about memory layout. A person who writes struct BadAlign a::Bool; b::Int64 end is responsible for their fate (and should consider struct BadAlign a::Bool; pad::NTuple{7,UInt8}; b::Int64 end).

It is to julia’s great credit that it produces C-compatible structure layouts instead of allowing the runtime to pretend to know better (as opposed to e.g. java – the JVM will change order of fields to minimize structure padding, marking a major pain for both C interop and e.g. separating two fields into two different cache-lines if both need to be concurrently accessed by different threads).

If you go for unaligned 7 byte object references, then references can span cache-line or even page boundaries, which will fuck up performance.

Palli · October 1, 2024, 10:37am

Yes, and no, it’s just that the user can’t “think” about this. You’re limited by the largest type in the structs (e.g. the 8-byte pointer in String) you add into your own struct. You don’t have the power to opt into 7-byte pointers, only this new string type, that would likely end up in Base, and then you benefit.

I’m a bit worried about the need for access to the pointer needing to be atomic, and I think it will not if it straddles cache lines, why 8-byte aligned is used by 64-bit pointers and [U]Int64.

Anyway I think we should aim for 8-byte string type max, if 7-byte not possible, because otherwise a potential problem.

Why did others not do that, use 8-byte structs max, for their “German strings”? Because 16-byte or even larger is still a win, I just believe 8-byte can be an ever greater win.

It only really does if it straddles page boundaries (and two 64-bit can also be on separate pages or cache lines), and I’m not even worried about that, just about correctness (atomic access, is that needed for reads/writes? Strings are immutable). You can atomically load all 7 bytes, in all cases, except I think if you straddle cache boundary. But is that even important? For two 64-bit registers, you can’t load them either? There’s no 128-bit alignment for those, so they can be on separate cache lines. 64-bit ARM has paired loads, but I still think they don’t need 128-bit alignment (not sure).

This document summarises some known mappings of C/C++11 atomic operations to x86, PowerPC, ARMv7, ARMv8, and Itanium instruction sequences.

tecosaur · October 1, 2024, 10:42am

Not to be a stuck record, but… benchmarks

Seeing is believing and all that.

p.s. C++'s string type is three words ~= 24B, just for reference

Palli · October 1, 2024, 11:33am

Well yes, which means you can’t have atomic access to all 24 bytes at once, so I think that perceived need of mine might be overblown. Or is there any other reason to fear misaligned?

C of course has simple (64-bit) pointers for their C strings, also used in C++. Plus C++ has:

std::string, LPCSTR, System::String, TCHAR

and more, one of the reason for additional bytes is to store the length (necessary since not zero-terminated), but also capacity, since mutable strings used (only helpful if and when strings actually mutated). Why you get 3x for the struct cs C, still 8-byte alignment, but I think the argument for 24+ isn’t that it’s better than 8 or 7 in structs, just better then even larger. People are already going for just 16 bytes in C++, i.e. consider going to smaller a win, so I’m thinking where is the limit to that.

FYI: You can actually have atomic access for 128 bits on recent CPUs:

Though open issue:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Palli · October 1, 2024, 11:52am

Most pointers fit 32 bits actually. And most need not be byte aligned (in Julia at least, for strings or other things…).

So might this help (with alignment):

struct my_7_string
  _4_byte_pointer::UInt32  # (512-byte aligned)
  _1::UInt8
  _2::UInt8
  _3::UInt8
end

This allows addressing 2 TB of string data, total (and more with clever hacks and/or indirection). The alignment will be 4-byte not 8-byte, a good thing (nor 1-byte as in previous suggestion), for all using this type. If you need byte-aligned pointers, then an offset from the pointer can be hacked into the other three bytes… Though it’s probably best to handle Cstring differently, or taking over such data, with indirection, it’s also rare that to need byte-aligned in practice, and for most Cstrings the fast-path would just work.

julia> struct str
         b::Bool
         s::my_7_string
       end

julia> sizeof(str)  # Still 25% saving vs current Strings
12

When the order in the str struct is reversed you get the ideal 50% savings possible, i.e. when no additional byte for the Bool, no padding.

nhz2 · October 1, 2024, 1:33pm

I’m not sure if there can be a pointer method for strings that are in the short form, because what prevents the newly heap-allocated string from getting garbage collected as soon as pointer returns? Base.cconvert(Ptr{UInt8}, x) would be needed I think.

Palli · October 1, 2024, 3:56pm

This is first of, not very useful for short (or long) strings. If you only have a pointer to the string, not full string type, then it implies a Cstring, and you would do that, explicitly, and yes it would need to allocate (but not for the long form). [We still want any use of pointer to work, would it be considered breaking otherwise, but is it used much? It doesn’t work on strings with zeros in the middle(?!).]

julia> println(unsafe_string(pointer("Palli\0Palli")))  # bug
Palli

julia> Base.unsafe_convert(Cstring, "Palli\0Palli")  # safe[er...]
ERROR: ArgumentError: embedded NULs are not allowed in C strings: "Palli\0Palli"

The former is a bit faster (or a lot faster with long strings), when you know or assume no NULs in the middle.

Note:

help?> pointer
.. This function is "unsafe". .. The GC.@preserve macro should be used to protect the array argument from garbage collection within a given block of code.

Calling Ref(array[, index]) is generally preferable to this function as it guarantees validity.

The pointer for long (or short) form will probably not be GC-tracked, for that to work the pointer needs to be in the long-form AND the short form (since it’s a C-style union), and that needs modification to the GC, unless bitstealing is never used on the pointer, meaning other trade-offs.

I can see that we want the pointer is only available in the long form, then you still need (a trivial) change to the GC, or another option is that the GC doesn’t even track the long form.

mikmoore · October 1, 2024, 10:34pm

I did some experiments with stuff like this

struct InlineString <: AbstractString
	dat::NTuple{14,UInt8}
	len::Int8
end
struct HeapString <: AbstractString
	ptr::Ptr{UInt8}
	len::Int32
	# only 12B before padding
	# we could use more but that's not the focus here
end
struct UnionString <: AbstractString
	str::Union{InlineString, HeapString}
end

sizeof(InlineString), sizeof(HeapString), sizeof(UnionString)
# (15, 16, 24)

The idea was to use Julia’s normal Union to avoid (?) GC complications. It costs 1 byte to do so (for the tag). However, this HeapString is padded from 12 up to 16 bytes and then the UnionString is padded from 16 to 24 bytes. If the Union resolved sizing before tail padding then this could have fit in 16 (i.e., a Union of two types with sizeof<16 can fit in 16B). Changing struct-embedded Union to recycle tail padding for the tag byte would enable this UnionString sort of thing to work in the intended size, but that’s a big (and maybe difficult) feature that may not be of much value elsewhere.

Palli · October 2, 2024, 2:07pm

I managed the 1-byte overhead for Unions this way:

struct InlineString <: AbstractString
  data::NTuple{7,UInt8}
end

struct HeapString <: AbstractString
  # len1::UInt8, it doesn't matter on which side this is....
  ptr::NTuple{6, UInt8}  # to emulate 6-byte-pointer; UInt32 also possible for Ptr32{UInt8}
  len1::UInt8
end

struct UnionString <: AbstractString
  str::Union{InlineString, HeapString}
end

julia> sizeof(InlineString), sizeof(HeapString), sizeof(UnionString)
(7, 7, 8)

NTuple (there in HeapString) isn’t 8-byte aligned, since it’s based on UInt8, but I think something like NTuple could still be made to be 8- or 4-byte aligned, despite that. I.e. a pointer type that isn’t full 8 bytes could be made and would be very useful, there addressing 256 TB, or way more if not byte-addressable.

You could just skip the length for HeapString, rather use it for a prefix, and/or shorten the pointer more, but (if len kept then 255, would signal) longer lengths will be kept on the heap, and it doesn’t need an extra indirection, but that’s one possibility to do it.

Also len isn’t needed for the InlineString, so you can get 7 data bytes, and even, since you do not manage the full 8 you want, get it with a huge hack…, exploit the Union, since it’s 8-bits:

struct UnionString <: AbstractString
  str::Union{HeapString, InlineString_starts_with_A, InlineString_starts_with_B, ... etc.}
end

Ideally a Union would only have 1-bit overhead, when that is possible…, i.e. for just two cases. Then you could use something like bitstypes that are not power of two.

Topic		Replies	Views
Community string benchmark suite General Usage	3	153	October 1, 2024
String optimisation in Julia General Usage performance , strings , io	21	564	September 21, 2024
Is it possible to see performant string processing without sacrificing ease of use in Julia one day? Performance question , strings	7	788	May 28, 2022
Improved allocation design, with 4-byte pointers, and sometimes 5-byte in effect Internals & Design	10	740	December 12, 2024
Performance of length(::String) Performance	24	3885	July 28, 2018

Redesigning String, optimising small strings and comparison

Related topics