Community string benchmark suite

Hello!

I’d like your help formulating a high-quality suite of string benchmarks. Over in Internals & Design I’ve started a conversation about potentially renovating String’s memory representation to reduce memory usage with small strings, and improve comparison performance across the board :grinning:

At this point, a bunch of potential designs have been floated, but one of the things that makes them hard to evaluate is that a lot of the choices come down to tradeoffs, and (just speaking for myself) I’m not confident enough to say how certain tradeoffs will work out in practice without trying them out.

To this end, I’ve put together a large corpus of test strings by scraping the source code of all registered Julia packages, and extracting all literal strings.

It would be great to have some help constructing a highly informative set of benchmarks from this. Beyond benchmarking basic operations like ==, isless, and hash, I’m interested in how this could affect higher-level functionality like leftjoin! on a DataFrame with String columns, or CSV.read using String instead of InlineStrings.

These are two examples of operations where String performance matters, but I’m sure the community as a whole has a much broader view of String-relevant operations than I do.

Call to action

So, if I could get your help thinking of String-oriented behavior that would be good to benchmark, that would be tremendous. Better benchmarks will let us better evaluate design tradeoffs, and might even get us to a better String type down the line :wink: (no promises though).

Complete benchmark snippets (with third-party packages allowed) would be ideal! :star_struck:

p.s. I’m also interested in supplementary string corpuses, if you have any suggestions.

4 Likes

I often work with large tables (say in the tens of GB) with many string columns, and used to frequently run into situtations where the GC was stressed to a level that made the whole session completely unresponsive.

I posted an example of how the base String compares to ShortStrings (which I was using at the time, this was before InlineStrings were a thing) here, have a look to see if that’s useful:

1 Like

That means the GC wasn’t aggressive enough, but currently is? With a different string type (e.g. reference counted) strings can be freed eagerly, you most often know when strings go out of scope, and/or a clever compiler could de-allocate the (current default) strings allocated in a in a loop already.

But none of those changes are needed: if you allocate and add up memory, and most of it is garbage, then it shouldn’t affect responsivity (ideally), the GC should just be more aggressive. I believe it is (and now it’s multi-threaded, and that part can be tuned, though likely and hopefully the defaults good enough).

The short string optimization proposal under discussion, at internals (basically a string type, that would end up in Base, behaving as InlineStrings speed-wise, but still allowing longer strings, then in the heap transparently), would take away the GC load completely when the strings are short, like InlineStrings do already but in addition be also very good (slightly better) for sorting.

Here is a benchmark that is similar to the string processing I am doing in ZipArchives.jl. I have a large vector of bytes and a vector of ranges in that buffer that contain entry names. I then want to search for a specific entry name. The strings represent paths so most of them will have a common prefix.

using BenchmarkTools
using Random

Random.seed!(1234)

# Create a data buffer and ranges containing N random paths.
# The paths all start with "root-path/" and then have some additional random bytes
function create_buffer_ranges(N)
    buffer = UInt8[]
    ranges = UnitRange{Int64}[]
    for i in 1:N
        rand_path = [b"root-path/"; rand(UInt8, rand(5:20))]
        start = length(buffer) + 1
        append!(buffer, rand_path)
        stop = length(buffer)
        push!(ranges, start:stop)
        append!(buffer, rand(UInt8, rand(30:50))) # add on other random data
    end
    buffer, ranges
end

# Find the last path in the buffer with the same codeunits as needle
function findlast_range(needle, buffer::Vector{UInt8}, ranges::Vector{UnitRange{Int64}})
    findlast(ranges) do r
        length(r) == ncodeunits(needle) && view(buffer, r) == codeunits(needle)
    end
end

buffer, ranges = create_buffer_ranges(2000)

first_path = String(buffer[ranges[1]])

@btime findlast_range($first_path, $buffer, $ranges)