Redesigning String, optimising small strings and comparison

Whenever I’ve had to do large-scale string work in Julia, I tend to be less impressed with performance than when doing numerical work. This has produced a niggle, and that niggle as lead to some investigation, which has in turn resulted in me prototyping a rewrite of the String type.

Performance improvements (compared to the current String)

  • ==: 1x to 2.8x faster
  • hash: 1x to 2.1x faster
  • cmp: 0.95x to 1.9x faster
  • startswith: 1.3x to 2.2x faster
  • dict getindex: 1.3x faster

Now that I (hopefully) have your attention, here’s what’s going on.

Background

This is a project that I started on Slack a few months ago, and since that conversation has fallen into the slack-hole I thought I should record this work somewhere a little more permanent. Here it is :slightly_smiling_face:.

For starters, it’s worth realising that:

  • All Strings in Julia take a minimum of 4-words/32-bytes
  • Everything other than length involves a pointer lookup

The memory layout looks a bit like this (I had a better diagram on Slack, but it’s part of the losses — this might be a bit off but it’s got all the right bits):

rect1

There are a few alternative string forms that I’ve become aware of, optimising for space and efficiency of string comparisons. I’d like to draw your attention to:

I’ve ended up with an idea that seems fun (I think) that’s a bit of a hybrid of these, based around a base size of 2-words/16-bytes:

rest2

Regardless of whether we have a short or long-form string, we can look at the first 4 bytes of the string without any pointer lookups, as well as the length. This makes comparing the first word/8-bytes very informative. The downside is we limit the size of a single string to ~4 GiB, which seems like it could be an acceptable compromise. By storing the not length in the short form, every length from zero (0xff) to fourteen (0xf1) is non-zero, and so works as a tag to differentiate between pointers (which malloc aligns to the nearest 8 or 16 bytes, leaving the last byte available for an informative tag :wink:, and if need be we could make do with 4 bits instead of a whole byte).

The complexity of the dual-format structure adds some overhead, but you can do a whole bunch of basic operations in the same time as an L1 cache lookup, which made me think this would be worth actually trying.

Unlike the German strings used in databases, we duplicate the prefix in the data section, making both the short and long representation contain complete null-terminated (C-compatible) string forms.

All in all, this strikes me as potentially a rather nice compromise, potentially improving both the footprint and performance of String-intensive tasks in Julia.

Implementation

This string type essentially takes two forms

struct InlineStringForm
    data::NTuple{14, UInt8}
    _null::UInt8
    nlen::UInt8
end

struct HeapStringForm
    prefix::NTuple{4, UInt8}
    len::UInt32
    ptr::Ptr{UInt8}
end

Because of the overhead of doing reinterpreting + union types, I’ve instead opted for a primative 16-byte type with logic to determine how it should be treated. This type is somewhat optimistically called String2.

primitive type String2 <: AbstractString 128 end

We can tell if this is an inline string or not by checking the last byte of the String2.

const INLINE_LENGTH_MASK = UInt128(0xff) << 120

isinline(s::UInt128) = !iszero(s & INLINE_LENGTH_MASK)
isinline(s::String2) = isinline(reinterpret(UInt128, s))

Without fiddling with the C side of Julia, the garbage collector has no way of knowing when there’s heap-allocated data to collect. That said, to work with this type for prototyping and benchmarking here’s a constructor that takes a current String:

String2/String conversion + deallocation function
function String2(s::String)
    if ncodeunits(s) <= 14
        data = Memory{UInt8}(undef, 15)
        fill!(data, 0x00)
        unsafe_copyto!(pointer(data), pointer(s), ncodeunits(s))
        reinterpret(String2, (NTuple{15, UInt8}(data)..., ~(length(s) % UInt8)))
    else
        prefix = Memory{UInt8}(undef, 4)
        unsafe_copyto!(pointer(prefix), pointer(s), 4)
        dataref = Libc.malloc(ncodeunits(s) + 1)
        Libc.memcpy(dataref, pointer(s), ncodeunits(s) + 1)
        ss_tuple = (NTuple{4, UInt8}(prefix)..., UInt32(ncodeunits(s)), UInt64(dataref))
        ss = reinterpret(String2, ss_tuple)
        # finalizer(deallocate, ss) # Can't attach finalizers to immutable types
        ss
    end
end

function deallocate(s::String2) # Can't call automatically 🥲
   isinline(s) && return
    _, ptr = reinterpret(Tuple{UInt64, UInt64}, s)
    Libc.free(ptr)
end

function Base.String(s::String2)
    if isinline(s)
        data..., _, nlen = reinterpret(NTuple{16, UInt8}, s)
        String(collect(data[1:~nlen]))
    else
        _, ptr = reinterpret(Tuple{UInt64, UInt64}, s)
        unsafe_string(Ptr{UInt8}(ptr))
    end
end

There’s a bit more boring work to implement the AbstractString interface just so we can see the string in the REPL, etc.

AbstractString interface
function Base.ncodeunits(s::String2)
    u = reinterpret(UInt128, s)
    if isinline(s)
        ~(u & INLINE_LENGTH_MASK) >> 120 % Int
    else
        (u1 % Int64) >> 32 % Int
    end
end

function Base.codeunit(s::String2, i::Int)
    if isinline(s)
        data..., _, nlen = reinterpret(NTuple{16, UInt8}, s)
        @boundscheck i <= ~nlen || throw(BoundsError(s, i))
        data[i]
    else
        _, len, ptr = reinterpret(Tuple{UInt32, UInt32, UInt64}, s)
        @boundscheck i <= len || throw(BoundsError(s, i))
        unsafe_load(Ptr{UInt8}(ptr), i)
    end
end

Base.codeunit(::String2) = UInt8

Base.iterate(s::String2, i::Int=1) = iterate(String(s), i)

Base.isvalid(s::String2) = Base.isvalid(String, codeunits(s))
Base.isvalid(s::String2, i::Int) = checkbounds(Bool, s, i) && thisind(s, i) == i
Base.thisind(s::String2, i::Int) = thisind(String(s), i)

Now we get onto the fun stuff, implementing optimised performance-critical basic string methods. Checking Base.infer_effects was wildly pessimistic, so after a read of Base.@assume_effects I think that :total may be appropriate here (:crossed_fingers:). One or two methods probably need to be downgraded to just :foldable, but nothing’s segfaulted so far.

Implementing equality
Base.@assume_effects :total function Base.:(==)(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2) # For efficiency
    s1a, s2a = u1 % UInt64, u2 % UInt64 # Prefix1-8 or Prefix1-4 + Length
    s1a == s2a || return false # ∵ Different prefix and/or length
    s1b, s2b = u1 >> 64 % UInt64, u2 >> 64 % UInt64 # Prefix9-14 + Length or Ptr
    s1b == s2b && return true # ∵ Completely identical
    if isinline(s1) || isinline(s2)
        false # ∵ We've checked the entire inline form, if either are inline, they're not equal
    else
        len = s1a >> 32 # `s1a == s2a` implies identical lengths
        Libc.memcmp(Ptr{UInt8}(s1b), Ptr{UInt8}(s2b), len) == 0
    end
end
Implementing comparison
Base.@assume_effects :total function Base.hash(s::String2, h::UInt)
    h += hash(String2)
    u = reinterpret(UInt128, s)
    ub = u >> 64 % UInt64
    if isinline(s)
        ua = u % UInt64
        hash(hash(ua, h), ub) # This is faster than `hash(u, h)` for some reason
    else
        len = ua >> 32
        h += Base.memhash_seed
        ccall(Base.memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), Ptr{UInt8}(ub), len, h % UInt32) + h
    end
end
Implementing cmp
Base.@assume_effects :total function Base.cmp(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2)
    pre1, pre2 = u1 % UInt32, u2 % UInt32
    cmp0 = cmp(pre1, pre2)
    if cmp0 != 0
        cmp0
    elseif isinline(s1) && isinline(s2)
        cmp(u1, u2)
    elseif isinline(s1)
        ptr2 = u2 >> 64 % UInt64
        s2pre = unsafe_load(Ptr{UInt128}(ptr2), 1) & typemax(UInt128) >> 8
        cmp(u1, s2pre)
    elseif isinline(s2)
        ptr1 = u1 >> 64 % UInt64
        s1pre = unsafe_load(Ptr{UInt128}(ptr1), 1) & typemax(UInt128) >> 8
        cmp(s1pre, u2)
    else
        len1, len2 = (u1 % UInt64) >> 32, (u2 % UInt64) >> 32
        cmp1 = Base.memcmp(Ptr{UInt8}(u1 >> 64 % UInt64), Ptr{UInt8}(u2 >> 64 % UInt64), min(len1, len2) % Csize_t)
        if cmp1 != 0
            sign(cmp1)
        else
            cmp(len1, len2)
        end
    end
end
Implementing startswith
Base.@assume_effects :total function Base.startswith(s::String2, prefix::String2)
    us, up = reinterpret(UInt128, s), reinterpret(UInt128, prefix)
    us == up && return true
    # Unfortunately for readability, doing `u >> 8 << 8` seems
    # to be a little faster than `u | INLINE_LENGTH_MASK`.
    if isinline(us) && isinline(up) # => inline
        up >> 8 << 8 == (us & up) >> 8 << 8
    elseif isinline(up)
        usp = unsafe_load(Ptr{UInt128}(us >> 64 % UInt64), 1)
        up >> 8 << 8 == (usp & up) >> 8 << 8
    else
        slen, plen = (us % UInt64) >> 32, (up % UInt64) >> 32
        slen >= plen || return false
        sptr, pptr = us >> 64 % UInt64, up >> 64 % UInt64
        Base.memcmp(Ptr{UInt8}(sptr), Ptr{UInt8}(pptr), plen) == 0
    end
end

I’ve also managed to get efficient concatination of short-form strings working nicely (it’s essentially an or + bitshift), but long/mixed cases are a bit more annoying.

Implementing 2-arg concat (WIP/more optimisation needed)
Base.@assume_effects :total function Base.:*(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2)
    len1 = if isinline(s1) ~(u1 >> 120 % UInt8) else (u1 % UInt64) >> 32 end
    len2 = if isinline(s2) ~(u2 >> 120 % UInt8) else (u2 % UInt64) >> 32 end
    newlen = len1 + len2
    if newlen <= 14
        newdata = u1 | (u2 << (8 * len1)) # Concatenate the data
        newdata = newdata & ~INLINE_LENGTH_MASK # Clear length
        newdata = newdata | (~newlen % UInt128) << 120 # Apply new length
        reinterpret(String2, newdata)
    else
        concat_large(u1, len1 % UInt64, u2, len2 % UInt64, newlen % UInt64)
    end
end

function concat_large(u1::UInt128, len1::UInt64, u2::UInt128, len2::UInt64, newlen::UInt64)
    newdata = Libc.malloc(newlen + 1)
    s1a = u1 % UInt64
    s1b, s2b = u1 >> 64 % UInt64, u2 >> 64 % UInt64
    if isinline(u1)
        unsafe_store!(Ptr{UInt64}(newdata), s1a, 1)
        unsafe_store!(Ptr{UInt64}(newdata), s1b, 2)
    else
        Libc.memcpy(Ptr{UInt8}(newdata), Ptr{UInt8}(s1b), len1)
    end
    if isinline(u2)
        for b in 1:len2
            unsafe_store!(Ptr{UInt8}(newdata), (u2 >> (8 * (b - 1))) % UInt8, len1 + b)
        end
    else
        Libc.memcpy(Ptr{UInt8}(newdata + len1), Ptr{UInt8}(s2b), len2 + 1)
    end
    ss_tuple = (unsafe_load(Ptr{UInt32}(newdata), 1),
                newlen % UInt32, newdata)
    ss = reinterpret(String2, ss_tuple)
    # finalizer(deallocate, ss)
    ss
end

For a serious/complete reimplementation of String, some fiddling around on the C side of Julia (in particular with GC) is needed. For now, I’ve left this alone and focused on seeing if the fundamental design seemed promising just writing Julia code.

Simple benchmarking

Current implementation, all in one place
# An experimental short string type for Julia

# This is a proof of concept for a short string type in Julia. The idea is to
# have a string type that can store short strings inline, without heap
# allocation. This is useful for strings that are often short, such as keys in
# dictionaries or identifiers.
#
# This is a hybrid of C++'s small string optimisation, and a German string.
# On little endian systems, we can use a tagged pointer for the length of
# the string, and store the data inline. This allows us to store strings up to
# 14 bytes inline, without heap allocation. For longer strings, we fall back to
# a heap allocated string and make use of the extra space after the pointer
# to store the first 8 bytes of the string.
#
# To handle the two representations, we define an opaque (primitive) type
# that we will reinterpret to the appropriate form as needed.
#
# This is just a proof of concept, and needs adjustment to also work on
# big endian systems. GC is also broken for this type, as it doesn't know
# to deallocate the heap allocated strings.

primitive type String2 <: AbstractString 128 end

#=
struct InlineStringForm
    data::NTuple{14, UInt8}
    _null::UInt8
    nlen::UInt8
end

struct HeapStringForm
    prefix::NTuple{4, UInt8}
    len::UInt32
    ptr::Ptr{UInt8}
end
=#

const INLINE_LENGTH_MASK = UInt128(0xff) << 120

# Really, the only thing that might happen here is memory
# allocation failure. That's basically `:total`, right?
Base.@assume_effects :total function String2(s::String)
    if ncodeunits(s) <= 14
        data = Memory{UInt8}(undef, 15)
        fill!(data, 0x00)
        unsafe_copyto!(pointer(data), pointer(s), ncodeunits(s))
        reinterpret(String2, (NTuple{15, UInt8}(data)..., ~(length(s) % UInt8)))
    else
        prefix = Memory{UInt8}(undef, 4)
        unsafe_copyto!(pointer(prefix), pointer(s), 4)
        dataref = Libc.malloc(ncodeunits(s) + 1)
        Libc.memcpy(dataref, pointer(s), ncodeunits(s) + 1)
        ss_tuple = (NTuple{4, UInt8}(prefix)..., UInt32(ncodeunits(s)), UInt64(dataref))
        ss = reinterpret(String2, ss_tuple)
        # finalizer(deallocate, ss) # Can't attach finalizers to immutable types
        ss
    end
end

function deallocate(s::String2) # Can't call automatically 🥲
   isinline(s) && return
    _, ptr = reinterpret(Tuple{UInt64, UInt64}, s)
    Libc.free(ptr)
end

isinline(s::UInt128) = !iszero(s & INLINE_LENGTH_MASK)
isinline(s::String2) = isinline(reinterpret(UInt128, s))

Base.@assume_effects :total function Base.String(s::String2)
    if isinline(s)
        data..., _, nlen = reinterpret(NTuple{16, UInt8}, s)
        String(collect(data[1:~nlen]))
    else
        _, ptr = reinterpret(Tuple{UInt64, UInt64}, s)
        unsafe_string(Ptr{UInt8}(ptr))
    end
end

Base.@assume_effects :total @inline function Base.ncodeunits(s::String2)
    u = reinterpret(UInt128, s)
    if isinline(s)
        ~(u & INLINE_LENGTH_MASK) >> 120 % Int
    else
        (u1 % Int64) >> 32 % Int
    end
end

Base.@assume_effects :total function Base.codeunit(s::String2, i::Int)
    if isinline(s)
        data..., _, nlen = reinterpret(NTuple{16, UInt8}, s)
        @boundscheck i <= ~nlen || throw(BoundsError(s, i))
        data[i]
    else
        _, len, ptr = reinterpret(Tuple{UInt32, UInt32, UInt64}, s)
        @boundscheck i <= len || throw(BoundsError(s, i))
        unsafe_load(Ptr{UInt8}(ptr), i)
    end
end

Base.codeunit(::String2) = UInt8

Base.iterate(s::String2, i::Int=1) = iterate(String(s), i)

Base.isvalid(s::String2) = Base.isvalid(String, codeunits(s))
Base.isvalid(s::String2, i::Int) = checkbounds(Bool, s, i) && thisind(s, i) == i
Base.thisind(s::String2, i::Int) = thisind(String(s), i)

# Here I implement faster versions of basic string operations
# to compare with the built-in string type.

Base.@assume_effects :total function Base.:(==)(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2) # For efficiency
    s1a, s2a = u1 % UInt64, u2 % UInt64 # Prefix1-8 or Prefix1-4 + Length
    s1a == s2a || return false # ∵ Different prefix and/or length
    s1b, s2b = u1 >> 64 % UInt64, u2 >> 64 % UInt64 # Prefix9-14 + Length or Ptr
    s1b == s2b && return true # ∵ Completely identical
    if isinline(s1) || isinline(s2)
        false # ∵ We've checked the entire inline form, if either are inline, they're not equal
    else
        len = s1a >> 32 # `s1a == s2a` implies identical lengths
        Libc.memcmp(Ptr{UInt8}(s1b), Ptr{UInt8}(s2b), len) == 0
    end
end

Base.@assume_effects :total function Base.hash(s::String2, h::UInt)
    h += hash(String2)
    u = reinterpret(UInt128, s)
    ub = u >> 64 % UInt64
    if isinline(s)
        ua = u % UInt64
        hash(hash(ua, h), ub) # This is faster than `hash(u, h)` for some reason
    else
        len = ua >> 32
        h += Base.memhash_seed
        ccall(Base.memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), Ptr{UInt8}(ub), len, h % UInt32) + h
    end
end

Base.@assume_effects :total function Base.cmp(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2)
    pre1, pre2 = u1 % UInt32, u2 % UInt32
    cmp0 = cmp(pre1, pre2)
    if cmp0 != 0
        cmp0
    elseif isinline(s1) && isinline(s2)
        cmp(u1, u2)
    elseif isinline(s1)
        ptr2 = u2 >> 64 % UInt64
        s2pre = unsafe_load(Ptr{UInt128}(ptr2), 1) & typemax(UInt128) >> 8
        cmp(u1, s2pre)
    elseif isinline(s2)
        ptr1 = u1 >> 64 % UInt64
        s1pre = unsafe_load(Ptr{UInt128}(ptr1), 1) & typemax(UInt128) >> 8
        cmp(s1pre, u2)
    else
        len1, len2 = (u1 % UInt64) >> 32, (u2 % UInt64) >> 32
        cmp1 = Base.memcmp(Ptr{UInt8}(u1 >> 64 % UInt64), Ptr{UInt8}(u2 >> 64 % UInt64), min(len1, len2) % Csize_t)
        if cmp1 != 0
            sign(cmp1)
        else
            cmp(len1, len2)
        end
    end
end

Base.@assume_effects :total function Base.startswith(s::String2, prefix::String2)
    us, up = reinterpret(UInt128, s), reinterpret(UInt128, prefix)
    us == up && return true
    # Unfortunately for readability, doing `u >> 8 << 8` seems
    # to be a little faster than `u | INLINE_LENGTH_MASK`.
    if isinline(us) && isinline(up) # => inline
        up >> 8 << 8 == (us & up) >> 8 << 8
    elseif isinline(up)
        usp = unsafe_load(Ptr{UInt128}(us >> 64 % UInt64), 1)
        up >> 8 << 8 == (usp & up) >> 8 << 8
    else
        slen, plen = (us % UInt64) >> 32, (up % UInt64) >> 32
        slen >= plen || return false
        sptr, pptr = us >> 64 % UInt64, up >> 64 % UInt64
        Base.memcmp(Ptr{UInt8}(sptr), Ptr{UInt8}(pptr), plen) == 0
    end
end

# More work-in-progress methods

Base.@assume_effects :total function Base.:*(s1::String2, s2::String2)
    u1, u2 = reinterpret(UInt128, s1), reinterpret(UInt128, s2)
    len1 = if isinline(s1) ~(u1 >> 120 % UInt8) else (u1 % UInt64) >> 32 end
    len2 = if isinline(s2) ~(u2 >> 120 % UInt8) else (u2 % UInt64) >> 32 end
    newlen = len1 + len2
    if newlen <= 14
        newdata = u1 | (u2 << (8 * len1)) # Concatenate the data
        newdata = newdata & ~INLINE_LENGTH_MASK # Clear length
        newdata = newdata | (~newlen % UInt128) << 120 # Apply new length
        reinterpret(String2, newdata)
    else
        concat_large(u1, len1 % UInt64, u2, len2 % UInt64, newlen % UInt64)
    end
end

function concat_large(u1::UInt128, len1::UInt64, u2::UInt128, len2::UInt64, newlen::UInt64)
    newdata = Libc.malloc(newlen + 1)
    s1a = u1 % UInt64
    s1b, s2b = u1 >> 64 % UInt64, u2 >> 64 % UInt64
    if isinline(u1)
        unsafe_store!(Ptr{UInt64}(newdata), s1a, 1)
        unsafe_store!(Ptr{UInt64}(newdata), s1b, 2)
    else
        Libc.memcpy(Ptr{UInt8}(newdata), Ptr{UInt8}(s1b), len1)
    end
    if isinline(u2)
        for b in 1:len2
            unsafe_store!(Ptr{UInt8}(newdata), (u2 >> (8 * (b - 1))) % UInt8, len1 + b)
        end
    else
        Libc.memcpy(Ptr{UInt8}(newdata + len1), Ptr{UInt8}(s2b), len2 + 1)
    end
    ss_tuple = (unsafe_load(Ptr{UInt32}(newdata), 1),
                newlen % UInt32, newdata)
    ss = reinterpret(String2, ss_tuple)
    # finalizer(deallocate, ss)
    ss
end

function Base.sizeof(s::String2)
    if isinline(s)
        sizeof(typeof(s))
    else
        _, len, _ = reinterpret(Tuple{UInt32, UInt32, UInt64}, s)
        sizeof(typeof(s)) + len + 1
    end
end

Base.summarysize(s::String2) = sizeof(s)
REPL copypasta
julia> using Chairmarks

julia> s1to6 = s1, s2, s3, s4, s5, s6 = "hey", "hey", "woah", "a long string now", "another long string", "another long string"
("hey", "hey", "woah", "a long string now", "another long string", "another long string")

julia> ss1to6 = ss1, ss2, ss3, ss4, ss5, ss6 = map(String2, s1to6)
("hey", "hey", "woah", "a long string now", "another long string", "another long string")

julia> [(a, b) for (a, b) in Iterators.product(1:6, 1:6)]
6×6 Matrix{Tuple{Int64, Int64}}:
 (1, 1)  (1, 2)  (1, 3)  (1, 4)  (1, 5)  (1, 6)
 (2, 1)  (2, 2)  (2, 3)  (2, 4)  (2, 5)  (2, 6)
 (3, 1)  (3, 2)  (3, 3)  (3, 4)  (3, 5)  (3, 6)
 (4, 1)  (4, 2)  (4, 3)  (4, 4)  (4, 5)  (4, 6)
 (5, 1)  (5, 2)  (5, 3)  (5, 4)  (5, 5)  (5, 6)
 (6, 1)  (6, 2)  (6, 3)  (6, 4)  (6, 5)  (6, 6)

julia> [@b a == b for (a, b) in Iterators.product(s1to6, s1to6)] # Base String
6×6 Matrix{Chairmarks.Sample}:
 1.571 ns  4.325 ns  2.947 ns  2.948 ns  2.949 ns  2.948 ns
 4.323 ns  1.571 ns  2.947 ns  2.948 ns  2.947 ns  2.947 ns
 2.947 ns  2.947 ns  1.570 ns  2.948 ns  2.949 ns  2.947 ns
 2.948 ns  2.947 ns  2.947 ns  1.571 ns  2.948 ns  2.949 ns
 2.947 ns  2.948 ns  2.948 ns  2.948 ns  1.571 ns  4.337 ns
 2.947 ns  2.950 ns  2.946 ns  2.951 ns  4.324 ns  1.571 ns

julia> [@b a == b for (a, b) in Iterators.product(ss1to6, ss1to6)] # My String
6×6 Matrix{Chairmarks.Sample}:
 1.571 ns  1.571 ns  1.571 ns  1.571 ns  1.571 ns  1.570 ns
 1.571 ns  1.571 ns  1.570 ns  1.570 ns  1.570 ns  1.570 ns
 1.570 ns  1.570 ns  1.571 ns  1.571 ns  1.571 ns  1.570 ns
 1.571 ns  1.571 ns  1.570 ns  1.571 ns  1.570 ns  1.570 ns
 1.571 ns  1.571 ns  1.570 ns  1.571 ns  1.571 ns  2.357 ns
 1.570 ns  1.571 ns  1.570 ns  1.571 ns  2.357 ns  1.571 ns

julia> [@b hash(s) for s in s1to6]
6-element Vector{Chairmarks.Sample}:
 4.518 ns
 4.370 ns
 4.322 ns
 3.994 ns
 4.390 ns
 4.399 ns

julia> [@b hash(s) for s in ss1to6]
6-element Vector{Chairmarks.Sample}:
 2.143 ns
 2.152 ns
 2.144 ns
 4.104 ns
 4.428 ns
 4.422 ns

julia> [@b cmp(a, b) for (a, b) in Iterators.product(s1to6, s1to6)]
6×6 Matrix{Chairmarks.Sample}:
 2.750 ns  2.946 ns  2.946 ns  2.946 ns  2.947 ns  2.946 ns
 2.946 ns  2.750 ns  2.946 ns  2.946 ns  2.946 ns  2.946 ns
 2.946 ns  2.947 ns  2.750 ns  2.946 ns  2.946 ns  2.946 ns
 2.946 ns  2.946 ns  2.946 ns  2.750 ns  2.946 ns  2.946 ns
 2.946 ns  2.912 ns  2.946 ns  2.916 ns  2.750 ns  2.946 ns
 2.946 ns  2.946 ns  2.752 ns  2.947 ns  2.946 ns  2.750 ns

julia> [@b cmp(a, b) for (a, b) in Iterators.product(ss1to6, ss1to6)]
6×6 Matrix{Chairmarks.Sample}:
 2.160 ns  2.160 ns  1.571 ns  1.571 ns  1.571 ns  1.571 ns
 2.160 ns  2.160 ns  1.571 ns  1.571 ns  1.571 ns  1.571 ns
 1.571 ns  1.571 ns  2.160 ns  1.571 ns  1.571 ns  1.571 ns
 1.571 ns  1.571 ns  1.571 ns  2.946 ns  1.571 ns  1.571 ns
 1.571 ns  1.571 ns  1.571 ns  1.571 ns  2.947 ns  2.947 ns
 1.571 ns  1.571 ns  1.571 ns  1.571 ns  2.947 ns  2.947 ns

julia> [@b startswith(a, b) for (a, b) in Iterators.product(s1to6, s1to6)]
6×6 Matrix{Chairmarks.Sample}:
 3.536 ns  3.536 ns  2.356 ns  2.356 ns  2.356 ns  2.356 ns
 3.537 ns  3.536 ns  2.356 ns  2.356 ns  2.356 ns  2.356 ns
 3.536 ns  3.536 ns  3.535 ns  2.356 ns  2.356 ns  2.356 ns
 3.538 ns  3.536 ns  3.536 ns  3.536 ns  2.356 ns  2.356 ns
 3.537 ns  3.536 ns  3.537 ns  3.536 ns  3.536 ns  3.536 ns
 3.536 ns  3.537 ns  3.538 ns  3.536 ns  3.536 ns  3.536 ns

julia> [@b startswith(a, b) for (a, b) in Iterators.product(ss1to6, ss1to6)]
6×6 Matrix{Chairmarks.Sample}:
 1.571 ns  1.571 ns  1.571 ns  1.571 ns  1.573 ns  1.572 ns
 1.571 ns  1.571 ns  1.575 ns  1.572 ns  1.571 ns  1.572 ns
 1.574 ns  1.571 ns  1.571 ns  1.572 ns  1.571 ns  1.572 ns
 1.571 ns  1.571 ns  1.571 ns  1.571 ns  1.767 ns  1.767 ns
 1.571 ns  1.571 ns  1.571 ns  2.750 ns  1.571 ns  2.750 ns
 1.571 ns  1.571 ns  1.571 ns  2.750 ns  2.750 ns  1.571 ns

julia> dS = Dict(s1 => 1)
Dict{String, Int64} with 1 entry:
  "hey" => 1

julia> dsS = Dict(ss1 => 1)
Dict{String2, Int64} with 1 entry:
  "hey" => 1

julia> @b $dS[$s1]
6.314 ns

julia> @b $dsS[$ss1]
5.216 ns

Concluding comments

This isn’t a complete drop-in-replacement for String yet, but I think this demonstrates the potential of an alternate string layout in Julia, and hopefully can provoke some collaboration/help to improve this prototype/proposal, do more extensive benchmarking, and if it still looks good start work on an actual PR implementing this (eventually).

29 Likes

This is incredibly well-timed, I was about to try taking a stab at German strings in Julia myself.

Have you benchmarked your hybrid vs more or less standard German strings?

Somewhat unrelatedly, since this needs GC work I wonder if there’s a missing GC extension point. In an ideal world we’d be able to implement these optimally without modifying C or using primitive types to be honest.

2 Likes

Limiting the size of a string to only 4 GB will prevent this from being used as a drop-in replacement for String in many cases. Could the Length field steal a few bytes from Prefix or Pointer?

This would make the analogue of unsafe_wrap(String, ptr) impossible — this is used to wrap a string pointer (e.g. returned by a ccall) in a Julia String — because arbitrary string pointers need not be 8-byte aligned (e.g. they could be a “view” into a larger array, such as the pointer returned by strchr).

Couldn’t you rearrange the long form into pointer | length | prefix and ensure that the 2nd to last byte of the prefix is nonzero, to distinguish it from the NUL terminator of the short form? In particular, for long strings in which the 3rd byte is NUL (which is very uncommon), simply store a special prefix like 0x0000ff00 that indicates that the prefix contains no useful data.

2 Likes

Not at all, it would be cool to see how the current attempt compares though.

Would it really? I struggle to imagine practical occurrences of >4 GiB strings, particularly where a buffer of sorts isn’t much more sensible.

Perhaps. The challenge is how we shuffle things around. I’m very much open to ideas! The semi-competing interests I’ve got in mind are:

  • As much data as possible in the short form
  • Opportunity for optimisation via the German string style prefix
  • Simplicity
  • A long form that can handle everything we need it to
  • Efficiency
3 Likes

Hmmm, I didn’t realise we’d need to accommodate arbitrary non-malloc-aligned addresses. That does complicate things a bit.

Potentially? I see two complications that give me pause with this:

  • The complexity from a data discontinuity that needs to be navigated around
  • Breaking the consistency between the short/long form, having overlap in the format and the efficient quick checks it allows for is rather nice I think.

Being able to use the low-bits of a pointer for a tag is rather nice, but if that doesn’t work it’s worth mentioning that some of the high bits are also available (the first 16 IIRC), though it might be harder to have nice memory alignment using them (separately: maybe these could also be used for >4GiB length encodings without getting rid of the prefix).

Could we make an assumption that valid pointers are merely byte-aligned, or just lie on an even bit at least, and complicate the short-length encoding? I think that’s minimal assumption we need to make in order for low-bit overloading to be valid.

1 Like

How does that simplify things? Pointers are in bytes.

One option, assuming that the long-term goal is to replace the current String, is to assume that you will always have at least two kinds of strings:

  1. a String2 for Julia-allocated strings that are < 4GiB in size — almost all strings. (Though it may be a tricky problem in any given 64-bit application to guarantee that the 4GiB limit is never reached, or that it is okay to just throw an exception if that happens.)
  2. a StringView (ala StringViews.jl) for other UTF-8 data — something that can wrap arbitrary UInt8 arrays (which can be wrapped around arbitrary pointers).

It will be years before we can contemplate replacing String with anything that has such different restrictions, however — not until a hypothetical Julia 2.0 — so for the time being any String2 type will live in a package. In this case, any program that an live with the limitations of (1) can hopefully just drop it in, and otherwise they can use String.

3 Likes

I don’t see any problem with String2 being in a package. I do not think a restriction that can suddenly make your code fail or added assumptions more than people expecting their “strings” to be would be the right call for a standard library. They belong in a (package) library. Julia is well-designed in this regard that it accepts there can be multiple implementations of strings, hence the abstract string type.

1 Like

The implementation as is requires changes to the GC which cannot be made in packages

1 Like

@ScottPJones has done some alternative string implementations in Strs.jl that might be worth comparing against.

The JuliaCon talk features some benchmarks — are those public to be used for checking out new string implementations?

1 Like

Thank you for looking into this (it’s been on my TODO to implement my “German string” idea…).

I’m thinking should this be implemented in pure Julia at al (yes, it can be, I’m just saying it needs not be pure, Base could wrap C and/or Rust code, at least eventually)? Should we reuse some of the Rust (or C or C++ implementations)? I haven’t decided which, but if we do would it be easier to hand Strings over to Rust and vice versa (The Rust packages below are the so-called "impossible Rust implementations, so I’m not sure, would help those users, but not most Rust users that use Rust’s default string type, or rather it would copy… at least a pointer)? [Since it’s an alternative string idea anyway, maybe we should validate strange as UTF-8, which is currently optional, since slower, but superfast C++/assembly code is available for it, so we want to rely on other languages anyway.]

In my own idea the short form for strings uses all 16 bytes for data, up to 16 ASCII letters, actually more… (of those or UTF-8), because I want to compress the prefix.

I don’t see this as helpful. You want to fully use your 16 bytes, 128 bits, two registers, and since the struct is always a fixed (16 bytes) it most likely will be passed in registers, not on the stack (though might sometimes, depending on what the optimizer/register allocator does), so you can’t get a pointer to the stack-frame for a C-string, in general. Then why zero terminate at all (I would also not always do it for the long for if you want to allow for mutability, at least for concatenation); only when asking for a Cstring; trivia: Julia even has Cwstring…)?

And why actually store the length? It can be computed (for the short form). How often do you ask for the length anyway? Is it speed-critical for anything? You should iterate through the string usually.

The prefix doesn’t have to be in the same place for the long form, and it’s actually worse:

If the pointer is the first 8 bytes, aligning with your first 8 bytes of your prefix (assuming little endian, and we will never support big endian), then the first byte (or some of the first bytes, will be zero), a useless string to store in short form. [I think we should just ignore this optimization on 32-bit, there it wouldn’t work… as well.]

[A 4-byte prefix might get most of the bang for the buck, so I’m unclear if the following optimization is needed: would use 1 byte (or maybe 9-10 bits) for the length in long form, for then 7 bytes of prefix, but if last of its bit is set then the length is longer, 3, 4, or even 5 bytes, your choice, to allow for longer length. The length can even be stored on the heap if it’s better to have a long prefix also for the long form.

I think it’s too extreme to store strings (the short form) in 8 bytes, only one register, but you can actually do that, and it would be more practical with my prefix-compression idea (I was aiming for 2x, or even 3x in the common case).

Why is that exactly? This new string type can be done in a non-breaking way (assuming people do not assume Strings internals, that’s not part of semver…).

If you’re worried about limitation of 2/4 GB stings max, then arbitrary -length can be supported (I don’t see a huge reason to, but yes with such a max, this would be a technical breaking change). Are you required to be able to get a pointer to a String, its underlying data, for all strings, without going through Cstring? For smaller strings, that only exist in registers, it would only be possible with Cstring wrapper, and then it would need to allocate… It might be unexpected to some, but not ruled out, or I think considered a breaking change?!

I agree such an experiment can and should start in a package, it needs not be registered, for a proof-of-concept, but I want this in Julia 1.x sooner rather than later. If it starts in a package, then also ok, it could in effect be made a no-op/alias for this (new) built in string type. But I really think we ideally want only one major string type. At least it’s not good to require users to use or expect AbstractString (most might not know of it).

I didn’t take a good look at all the Rust implementations, but these might be best:
https://crates.io/crates/compact_str

https://crates.io/crates/strumbra

How do you do that exactly? Not with you example or:

julia> unsafe_wrap(String, Ptr{UInt8}(0), 1)
ERROR: MethodError: no method matching unsafe_wrap(::Type{String}, ::Ptr{UInt8}, ::Int64)

The docs or the error message were not helpful enough. I see:

Many methods in Julia such as unsafe_load and String make copies of data instead of taking ownership of the buffer, so that it is safe to free (or alter) the original data without affecting Julia. A notable exception is [unsafe_wrap](C Interface · The Julia Language, Tuple{T}, Tuple{Union{Type{Array}, Type{Array{T}}, Type{Array{T, N}}}, Ptr{T}, Tuple{Vararg{Int64, N}}}} where {T, N}) which, for performance reasons, shares (or can be told to take ownership of) the underlying buffer.

You would use unsafe_wrap(String … well for legal (UTF-8) Strings, usually, but I think you must allow for arbitrary illegal “strings” (i.e. illegal UTF-8). I’m asking since you I believe you must allow for that (while most uncommon use of this?); and give the length somehow, and then store it’s in the String object, and for this new String-type, and you can’t store it before or after that pointed to data? Or even overwrite the first few bytes with the length…?

For unsafe_string you can (and should?!) free the pointer, since you copy, but what about for unsafe_wrap?

help?> unsafe_wrap

Unlike unsafe_load and unsafe_store!, the programmer is responsible also for ensuring that the underlying data is not accessed through two arrays of different element type, similar to the strict aliasing rule in C.

I’m not sure what that means, can no one else use the wrapped string?

Ah sorry, in the early hours of the morning I wrote the wrong thing. I meant word/even-byte, not byte/even-bit.

A dedicated “UInt8 wrapper” form is an interesting idea, and one I haven’t considered. I wonder how that would work out.

The whole point of this is to make String faster and (in the case of short strings) smaller by default. Since we don’t provide access to String implementation details, this should be non-breaking. The only breaking part proposed sofar is the 4 GiB limit, and we might be able to find a creative way around that.

Good shout! At a quick test Strs is:

  • ==: faster than String, but not as fast as my String2
  • hash: not faster
  • cmp: slower than String
  • startswith: faster than String, but not as fast as my String2

I’ll get to Palli’s comments in a bit :slightly_smiling_face:

3 Likes

I wonder whether or not this design is a good one. Perhaps having staticstrings would be better for short strings? The Julia string was certainly not designed for short strings and needing short strings to be fast is likely a niche use.

I think variants of BigInt are the most important target for custom hybrid types (e.g. up to 63 bits inline, then pointer to malloc’ed / julia-allocated). The same machinery could then be used for different string types.

With custom hybrid types I mean layouts where it is not trivial which bytes are a managed reference, and how they need to be pre-processed.

Just brainstorming the kind of runtime/codegen support: We’d need to design an API / integration points for

  1. Custom GC mark plugins. E.g. types could get an extra function pointer that is called by the GC mark phase if non-null. This would be close to free for normal types. Users could write their custom mark plugin for their type in C, and use some ccall some API to register this.
  2. Custom codegen plugin for rooting / shadow stack. There needs to be some custom code that extracts GC controlled references. We cannot have a non-inlined function call for that – too slow. On the other hand, directly interfacing with codegen is probably too complex for normal users. So instead, users should probably ship llvm bitcode for their function, register that, and this can then be inlined. Realistically people want to write that in a high-level language like C, not llvm-assembly. So Julialang would probably need to provide some help for that – i.e. some headers and example makefiles containing the right compiler+linker invocations to emit “portable” or multi-versioned bitcode (the target hardware arch is not known at C-to-bitcode compile time of the plugin, only at bitcode-to-machine-code compilation).

If you’re just using strings as things that are printed, path names, etc. then yea the performance of the builtin/basic string type doesn’t really matter, but I’d caution against dismissing string performance as important for Julia — there are plenty of other cases where you can end up with a lot of small strings and the performance of basic comparison functions has a large impact on the overall program.

If we can make a basic builtin String type that’s strictly better than the current one (and I think we’ve got enough to suggest it’s worth trying, if noting else), I don’t think we should leave that performance on the table.

I’m not up on this stuff, but @jeff.bezanson 's trial PR a while ago lines up with this (which tries a different approach to the layout I’m brainstorming here):

2 Likes

I’m currently operating on the expectation that bits will have to be implemented in C (such as the GC tracking), but the more we can have as pure Julia the nicer, so I’m just seeing how far I can get by writing Julia.

This intrigues me, what do you mean by “compress the prefix”? I can’t see a clarification/elaboration in your later writing.

Good point. The inline form may as well skip the null byte and not worry so much about this.

If we don’t store the length, how do we know how many bytes are in the string? "\0\0\0" is a valid 3-char String in Julia. As I see it, we can go up to 15 chars in 16 bytes when the null byte is legal.

I can’t say I quite follow here, on two fronts:

  • The point of having an inline prefix in the long form is to be able to frequently shortcut some comparison operations, without having to do a memory lookup.
  • In the current proposal, the prefix of the short form and the pointer of the long form are not in alignment.

Yea, I’m making these assumptions for now.

It occurs to me that it the high 16 bits of the pointer might allow us to do strings up to ~64 TiB, but we’d then need to mask the pointer before using it as a pointer (I believe information in the high bits is UB on most systems).

Somethings that’s just occurred to me: mightn’t uses like this cause aliasing/ownership issues?

Yes, that’s why unsafe_wrap has the own keyword. I’m not sure whether this is documented for String though.

unsafe_wrap in the docs

EDIT: for some reason, discourse broke the link and I didn’t notice for a few hours…