ANN: StringUnits.jl

Inspired by this lengthy thread, which I jumped into somewhat later, I’ve implemented the idea as StringUnits.jl.

Pretty straightforward:

julia> "aβc∅👨🏻‍🌾efg"[2ch:4ch]
"βc∅"

julia> "aβc∅👨🏻‍🌾efg"[5gr]
"👨🏻‍🌾"

julia> "aβc∅👨🏻‍🌾efg"[3gr:6gr]
"c∅👨🏻‍🌾e"

Units can be mixed:

julia> "aβc∅👨🏻‍🌾efg"[2ch+1gr:3ch+2gr]
"c∅👨🏻‍🌾e"

Currently, negative indices of any sort are not supported, although I’d like to add that, time permitting.

Oh, and you can use normal byte-based indices as well:

julia> "aβc∅👨🏻‍🌾efg"[5:5+3gr]
"∅👨🏻‍🌾ef"

More details may be found in the documentation.

21 Likes

FWIW, have you considered other names?

  • StringIndexingUnits
  • StringIndexers
1 Like

The discussion thread I linked to had several names, including, of course, StringIndex.

I went with StringUnits because they’re units. Saying “five graphemes” is much like saying “five meters”, so StringUnits spells that 5gr, just like Unitful.jl spells the latter 5m.

StringIndex was/is a proposal to change how Julia itself does indexing, and differs in certain particulars from the approach taken in this package, although various contributors to that thread fleshed out basically all of the ideas which went into the implementation, at one point or another in it.

StringIndexingUnits seems redundant, and no, I never considered that, or StringIndexers.

I did consider naming the package StringIndices, but I took a look at how Unitful does units and decided on the name I ultimately chose. We have a string index, the code-unit. This adds additional units.

5 Likes

Nice package!
Regarding the name - while I see your point on the similarity to regular units, I think where the comparison to meters fails is that there is no fixed conversion factor between chars, graphemes, codepoints.

just my 2c worth of bike shedding

3 Likes

The package went through the standard 3-day registration process, which would have been the time to comment on package naming. At this point, while it may not be the perfect or most inspired name, it’s what we have. I’d recommend following the #new-packages-feed on Slack or Zulip for a chance to weigh in on new registrations.

So let’s not focus on the package name here and just congratulate @mnemnion for a package with some very nice functionality!

6 Likes

Sorry, I do not frequent these places.

That’s fine, but once a package is registered, the name is pretty much a done deal. Registered packages cannot really be renamed, although the author could decide to re-release the package under a new, different name. That kind of renaming is something that should be pretty rare, though, IMO.

2 Likes

Looks cool despite some quirks!
I wonder if there are packages implementing this approach to strings in other languages?

1 Like

Thank you. I’d like to point people to the original thread, StringIndex is @StefanKarpinski 's idea, and StringUnits is different enough that I didn’t want to arrogate the name.

Thanks!

This is true! Let me point you to a quote in the documentation:

A few of you are squirming in your chairs at this point. Yes, ‘addition’ of heterogeneous StringUnits doesn’t commute. Yes, this is abuse of notation. Yes, I’m interested in your breakdown of the real analysis of StringUnit metrics, including a notation. No, I won’t change StringUnits to use it. Yes, I would more-than-likely link to your contribution.

If I named the package StringMetrics, that would confuse almost everyone. Metrics can be associated with a unit, and while in physics care is taken to keep standard units convertible, in mathematics more broadly, it doesn’t work that way.

Consider a line drawn across a map. Two metrics might be “counties crossed” and “streets crossed”, which could be given units, so you could say “move two counties left” or “give me the third through the seventh street in the second county”. These would not be interconvertible either, but they would be units.

I don’t mind a bit of bikeshedding about the name! But I suspect we’ve hit the limit on what might be usefully said.

I’m not aware of any. Julia’s multiplication-by-juxtaposition is pretty unusual, as is the powerful genericity and dispatch which let me add methods to getindex when the first argument is already well-defined for Integer and UnitRange. So being able to exactly say things like str[1gr:5gr], no, I haven’t seen that.

Swift’s Character type is an extended grapheme cluster, which, presuming both Julia and Swift implement the Unicode standard correctly, should be identical to graphemes in Julia. The language makes it cheap and easy to work with these, at the cost of a somewhat opaque representation of some things: Swift has a String.Index which is an index to a specific character on a specific string, this is more like the StringIndex in the original thread and I believe it was the inspiration behind it. It’s all rather complicated, and Strings are also mutable.

Swift is the only language I know which chose to make graphemes the basic unit. It’s possible to get your hands on “Unicode scalar values”, what Julia calls a Char, but as far as I know, literal byte offsets (to codeunits) aren’t even available.

The quirks in StringUnits are really consequences of how Julia does intervals and indexing. So you need to keep in mind things like “second character after the fifth grapheme’s index” (5gr + 2ch) could be inside that grapheme, if it’s a long one like Farmer Bob here: :man_farmer:t2:

I expect the only mixed units which will be particularly useful are a mixture of a native byte offset with one of the categories, things like 50 + 7gr: the seven graphemes starting from index 50, which might come from findfirst or whatever.

5 Likes