Hello!
I’d like your help formulating a high-quality suite of string benchmarks. Over in Internals & Design I’ve started a conversation about potentially renovating String
’s memory representation to reduce memory usage with small strings, and improve comparison performance across the board
At this point, a bunch of potential designs have been floated, but one of the things that makes them hard to evaluate is that a lot of the choices come down to tradeoffs, and (just speaking for myself) I’m not confident enough to say how certain tradeoffs will work out in practice without trying them out.
To this end, I’ve put together a large corpus of test strings by scraping the source code of all registered Julia packages, and extracting all literal strings.
It would be great to have some help constructing a highly informative set of benchmarks from this. Beyond benchmarking basic operations like ==
, isless
, and hash
, I’m interested in how this could affect higher-level functionality like leftjoin!
on a DataFrame
with String
columns, or CSV.read
using String
instead of InlineStrings
.
These are two examples of operations where String
performance matters, but I’m sure the community as a whole has a much broader view of String
-relevant operations than I do.
Call to action
So, if I could get your help thinking of String
-oriented behavior that would be good to benchmark, that would be tremendous. Better benchmarks will let us better evaluate design tradeoffs, and might even get us to a better String
type down the line (no promises though).
Complete benchmark snippets (with third-party packages allowed) would be ideal!
p.s. I’m also interested in supplementary string corpuses, if you have any suggestions.