StringIndex idea (Julia 2.0)

Fortunately, Julia already has just such a function, done correctly, the lpad function

And this is why I love Julia, and I’m more than happy to have switched from Fortran/Python to Julia. It also demonstrates why Julia’s string handling is in good hands, and really doesn’t need to go back to the drawing board. :wink:

(Thanks for pointing out the bug in the Python code. You’re right, although it’s probably good enough for practical use)

3 Likes

8 posts were split to a new topic: Truncating padded text

The idea here is to make it easier to write less buggy code, surely. Julia doesn’t have to provide nextind or prevind either, or return a full character when indexed. It could do what Lua does: treat a string as an opaque byte array, and leave encoding matters up to userspace. I’m glad it doesn’t!

I’m also glad it doesn’t use UTF-32 so as to get O(1) codepoint indexing, I consider all these choices good. I wouldn’t want indexing into a string to do anything else, certainly I wouldn’t want to turn indexing with an Int into a String to turn into a codepoint operation. One of the core advantages of the view proposal is that it introduces no breaking changes, everything now in the language functions identically.

I think a collection of view types which treat indexing differently would be useful, and a collection of index types to go with them is an essential consequence of having them. It’s a good fit for Julia’s approach to genericity and dispatch, and it would greatly simplify code which works with several aspects of a string simultaneously. It exposes the complexity of strings and Unicode to the user, while also managing that complexity. I’ve written some grapheme-respecting code, and it’s a hassle, it would have been much easier to write correctly if a GraphemeView were available.

Others have also answered this, but yes, truncation and padding are very common operations on strings, and really should be done using graphemes or textwidth (bugs notwithstanding), depending on the medium. Codepoints aren’t as commonly useful, in fact, but it would be odd to omit them on that basis, they’re a valid and important way to handle a string.

I think all this would be a good addition to the core system, but maybe it should start life as a package to prove out the concept. There are some real unsolved questions about how all this code would work together, all of which would need to be answered before this could become a permanent addition to the Julia 1.x API.

Yes, but that’s the argument for an opaque StringIndex type, ala Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type · Issue #9297 · JuliaLang/julia · GitHub as mentioned above.

I don’t think it’s an argument for O(1) character indexing (e.g. via a CharString with cached indices as you were proposing).

We don’t “omit” codepoints — you can certainly access them, and unlike Python we have a whole type (Char) to represent them — we just don’t support O(1) indexing of the n-th codepoint (at least, not with String / UTF-8 … UTF32String in LegacyStrings.jl does this, of course).

To be clear, I’m not necessarily opposed to opaque string indices to catch bugs, though these introduce their own complications as noted above, and would be a breaking change (as return values from findnext etc). What I’m arguing against is the insistence that first-class string support requires a type with O(1) codepoint indexing, which so far is still apparently lacking any common practical application, and is not supported in the default string type of any mainstream language other than Python3.

I think the problem here might be that String is index-able at all. I think we would likely see fewer complaints if we instead had nth_char(::AbstractString, ::Integer) and nth_byte(::AbstractString, Integer), and getindex(::AbstractString, Integer) errored with a helpful error message.

4 Likes

I would expect a lot of complaints if you were unable to index a string, like you can in every single other programming language I’ve ever used.

The issue is that there’s more than one sensible way to index into a string, and a String type has to support one of those, rather than several.

There’s an obvious analogy here to reshaping a multidimensional Array as a Vector. A GraphemeString (or GraphemeView, not sure which emphasis is better) would reshape a String as graphemes. That’s basically the whole idea. A GraphemeIndex is its dual, an index into (any) string in terms of its graphemes.

1 Like

I don’t think anyone would object to a StringViews package with CodepointsView, GraphemesView, and TermColumnsView, each with their own indexing and iteration, and each defining v[i] as the appropriate SubString of the String they wrap.

That sounds like a super useful thing!

4 Likes

There is already a StringViews.jl package for a totally different purpose. But definitely someone would be welcome write a package that provide alternate indexing views of strings, if they feel so inclined.

(So far, no one has pointed to an actual practical use?)

2 Likes

Julia allows you to index into a string, too. It provides n-th codeunit for an arbitrary n in O(1), and allows you to save the location of arbitrary characters and substrings (e.g. during searching/iteration) for later O(1) access.

What it doesn’t provide is O(1) (or similar) access to the n-th codepoint. But neither does C, C++, Go, Swift, Rust, Java, Javascript, C#, or any other mainstream language other than Python3 AFAIK. Mostly they provide random access into codeunits (either UTF-8 or UTF-16) in their default string types, not Unicode characters. So I’m not sure what programming languages you’re referring to.

I quoted the entire post I was replying to, which is above the one you’re objecting to, causing discourse to remove the quote, which may have something to do with the severe confusion you’re exhibiting here.

Specifically, there was a proposal (perhaps a joke?) to disalllow getindex on strings entirely.

@Oscar_Smith suggested getindex(::AbstractString, Integer), not getindex entirely — I suspect that he was referring to the proposal to support getindex only for opaque StringIndex types (which was the genesis of this whole thread). I agree that we should have a getindex of some kind.

And I was saying that, in every programming language I’ve ever worked with, you can index a string with an integer.

Do you know of exceptions?

Swift has String.Index. Rust gives an error if you try to index a string directly (it allows integer indices for string slices, but very similar to Julia these are byte indices and will cause a runtime error if they don’t fall at character boundaries). It looks like Kotlin requires string.get(n) rather than supporting string[n] Kotlin s[n] is a synonym for s.get(n), apparently. There are probably other examples, but likely only from relatively new languages where UTF-8 is fully entrenched. (Many other languages support consecutive integer indices which return codeunits rather than codepoints, of course.)

3 Likes

No I was referring to getting rid of getting rid of getindex on Strings entirely and replacing it with a function lookup (since there are multiple possible things "hi"[1] could mean).

That seems too breaking even for a hypothetical Julia 2.0.

What’s the problem with "hi"[i] where i is an opaque type returned by, e.g. i = findfirst(somepredicate, "hi")?

Julia is a language where you can do this:

julia> i = 1
1

julia> i[1]
1

Disallowing indexing on Strings would be a very weird thing to do.

The downside of opaque type indexing is that you then have to figure out what the promotion rules are (i.e. can you add 1?)

IMO, the fact that Number behaves like a 0 dimensional collection sometimes was a pretty big mistake that was only not removed because by 0.7 too many things used it.

3 Likes

See the beginning of this thread and the linked issue for discussion of precisely that.

This idea can be made more ergonomic by merging it with another idea discussed previously:

Something similar to @jishnub’s OrdinalIndexing could be added to the language—this is almost identical to this thread’s OP, but with a nicer user interface that can generalize to other indexable types. The purpose of OrdinalIndexing is to guarantee 1-based indexing even for OffsetArrays. For example:

my_array = OffsetArray(1:10, -10)
my_array[4th] == 4

Applying this concept here, per-character string indexing could go like:

my_string[6th:10th]

(This is reminiscent of the my_string.nth(i) function in Rust)

The 4th object (let’s call it OrdinalIndex(4)) would store a single integer representing an ordinal index instead of a positional index. Arithmetic and ranges of OrdinalIndex would work as intuitively expected.

When an OrdinalIndex is added to an Integer, it’s promoted to a PositionOrdinalIndex, which could contain both position_offset and ordinal_index fields like the OP of this thread. For example, 0+6th would construct a PositionOrdinalIndex(0,6) and 6:7th could be PositionOrdinalIndex(6,0):PositionOrdinalIndex(0,7). (for 6th to 7th, use (6:7)th or 6th:7th). Adding PositionOrdinalIndex objects would add both fields, similar to a Complex.

Then, indexing from the end should work like this:

my_string[end-9th:end-5th]

It would also be really awesome for ordered dictionaries, wherein it could be imposed that ::OrdinalIndex and ::PositionOrdinalIndex hashkeys are not permitted, so that the dictionary entries can be accessed in insertion order by ::OrdinalIndex and ::PositionOrdinalIndex-typed indices:

my_ordered_dict[1th:100th] # first 100 entries

Maybe it could also be interesting for accessing sparse matrices—I’m not sure.

I’m agnostic on names btw, and as @StefanKarpinski ponders above it might not be worthwhile to maintain a OrdinalIndex type separate from PositionOrdinalIndex since the position will often be const-propped anyway: using only the latter would make the implementation of this idea analogous to the representation of complex numbers (wherein all imaginary numbers are represented as Complex even with zero real component; similar to im ≡ Complex(false,true) we would define const th = PositionOrdinalIndex(false,true)).

4 Likes