StringIndex idea (Julia 2.0)

sijo · February 9, 2024, 8:44am

Interesting but my_string[end-9th:end-5th] is disturbing, it goes against the semantics of th…

JM_Beckers · February 9, 2024, 9:00am

maybe

my_string[9thfromend:5thfromend] or something similar

sijo · February 9, 2024, 10:31am

I’d say my_string[end-9char:end-5char] would make most sense. To echo @goerz’s idea above we could have both 9char and 9grapheme…

uniment · February 9, 2024, 2:11pm

Can you explain what gives you this impression? I read end-9th as “the 9th from the end,” which describes the exact semantics that is hoped for.

I think the tricky part is to remember that end describes the positional index of the last element, rather than the ordinal index of the last element.

mnemnion · February 9, 2024, 3:51pm

I wouldn’t want to conflate ordinal syntax with StringIndex, although I think you’re on to something with using juxtaposition for StringIndex types.

Two reasons. The first is that indexing by codepoint isn’t an ordinal vs. cardinal distinction, so “If you use 4, it’s a byte index, if you use 4th it’s a codepoint index” would just be something to remember, it’s not at all obvious that 4th means the fourth codepoint, not the fourth codeunit.

The other one is that there are at minimum four ways to want to index a string: by its codeunits, codepoints, graphemes, and textwidth. I would argue that most times a developer is tempted to use codepoints, what they actually want is graphemes. Codepoints will prevent throwing an invalid index error, but they don’t prevent splitting up or or Â (I don’t have a good way to type Latin characters with combining codepoints, just pretend that one is composed).

But a syntax such as string[1ch:5ch], string[1gr:5gr], string[1wc:5wc] is a nicely-compact way to generate those sorts of index. wc because of the wcwidth function which never quite made it into the C standard, it could be 1tw to remind users that it’s using the textwidth function under the hood. I think this is fitting because the various approaches to indexing are effectively units, and this looks like a unit, and units are a place where highly-abbreviated names are considered acceptable, no one insists on 1m being spelled 1meter.

If this system also had CharString GraphemeString etc., the set should include 1cu (maybe just 1u?), so those can be addressed by a codeunit, since each would have a default interpretation of an Int. In fact, maybe that is the place to unify with ordinal indexing, since it carries the intended meaning of “index this in the normal way, whether or not that differs from whatever weird expectation the indexable type has for a number”.

uniment · February 9, 2024, 4:56pm

I don’t know what a codepoint is; I’m just a dumb user. All I know is that when I iterate over a String I get a sequence of Chars, and I want my_string[4th] === (my_string...,)[4] and my_string[6th:10th] === join((my_string...,)[6:10]) and my_string[end] === my_string[length(my_string)th].

I would think whoever cares about special string-specific indexing by codepoint, codeunit, grapheme or whatever should be okay with using a library for that.

mnemnion · February 9, 2024, 5:06pm

A codepoint is simply a Char, in every sense that matters.

Languages shouldn’t be optimizing for “dumb users” to write less-buggy code which still has bugs in it. Strings come with essential complexity and should have the tools they need to work with all of that complexity, not just some of it.

You haven’t explained why a Char is the correct “ordinal” way to count a string. In every other proposed use of OrdinalIndexing, indexable[5] and indexable[5th] are the same value, if the indexable type starts with an index of 1, which String does. Reusing that syntax for Strings breaks that contract.

I suspect you’re just reaching for it out of convenience: there’s a proposal on the table for a second syntax for indexing, why not reuse it here. Why not: because it doesn’t fit the application.

Let me put it this way: do you think your dumb user would expect the “fifth” element of “hey! ” to be the first half of the emoji?

uniment · February 9, 2024, 5:29pm

I think I have:

In other words, I would expect this to work:

for (i,c) = enumerate(my_string)
    @assert c === my_string[(i)th]
end

Iteration is the common thread tying all sorts of collections together, and ordinal indexing can be considered a [reliable, but sometimes inefficient] way to index them.

And indexing by both positional and ordinal index can get the best of both worlds when that’s an option.

This is because the OrdinalIndexing package was originally conceived only for dense arrays. I’m not proposing to implement it as-is; I propose a new contract (see for loop above) that makes it actually useful for strings and ordered dictionaries and whatever else, while retaining its behavior for dense arrays.

I would expect my_string[5th] to be (my_string...,)[5], whatever that may be.

If you disagree that (my_string...,)[5] should be whatever it is, then that’s an issue that should be taken up with String’s iteration interface.

mnemnion · February 9, 2024, 5:37pm

uniment:

In other words, I would expect this to work:
for (i,c) = enumerate(my_string)
    @assert c === my_string[(i)th]
end
Iteration is the common thread tying all sorts of collections together, and ordinal indexing can be considered a [reliable, but sometimes inefficient] way to index them.

Is that the actual contract of OrdinalIndexing? Does this apply to sparse arrays as well, for example? If so, ok, this is reasonable. If that isn’t in fact a correct generalization of the idiom, then it shouldn’t be used.

I still think it’s important to have index types which aren’t based on Chars, though. There are scripts like Devanagari, and many like it, where codepoints are practically useless.

uniment · February 9, 2024, 6:36pm

Yes my idea is strictly a generalization of that package’s contract to work with non-array types, with the exception of changing the semantics such that whereas 4+5th would previously create a 9th index, by my proposal it would remain 4+5th (a PositionalOrdinalIndex(4,5) representing the combination of a positional index of 4 and an ordinal index of 5; to get just the 9th ordinal index one would write (4+5)th or 4th+5th).

Ok! I think there’s room for both the ordinal indexing th concept and the string-specific ch, gr, wc ideas, although the latter would be better in a package imo (maybe wrong)

mnemnion · February 9, 2024, 6:38pm

Sounds like a reasonable PR to submit to that package then.

uniment · February 9, 2024, 6:40pm

Sure! More importantly, a language feature to consider.

mnemnion · March 27, 2024, 10:11pm

So I went ahead and implemented this.

Topic		Replies	Views
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1397	December 5, 2023
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2492	January 12, 2024
ANN: StringUnits.jl Package Announcements question , announcement , strings , indexing , units	8	598	March 28, 2024
Indexing Unicode Strings Internals & Design	10	1721	June 4, 2021
New to Julia / lost in translation / how to index a simple 2d-array of strings New to Julia question , strings , arrays	7	688	October 11, 2022

StringIndex idea (Julia 2.0)

Related topics