SubString doesn't work with unicode

It seems SubString works well with standard characters, but not for unicode characters. Below is an example in Julia 1.4.2:

julia> SubString("aaaa",1,2)
"aa"

julia> SubString("αααα",1,2)
ERROR: StringIndexError("αααα", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:12
 [2] SubString at .\strings\substring.jl:32 [inlined]
 [3] SubString(::String, ::Int64, ::Int64) at .\strings\substring.jl:38
 [4] top-level scope at REPL[79]:1

Julia indexes into strings by code unit, not by character. Take a look at the relevant docs section for how to correctly handle Unicode.

5 Likes

Thanks a lot. I didn’t think of that since variable names with unicode worked for me thus far. But seeing how delicate character encoding can be, it makes complete sense to have a sound basis within Julia to incorporate unicode into strings.

Hello,

I regularly stumble into this. There seem to be many String supporting packages, but I didn’t quickly find one that simply provided unicode-point indexing.

For now, a quick hack could be (no bounds checking)

function substring(s::AbstractString, i::Int, j::Int)
  chars = collect(c for c in s)
  return join(chars[i : j])
end

but this isn’t very efficient for repeated operation on the same string.

1 Like

Is this in the Julia Forem?

Thanks, the macro looks impressive!

However I can’t imagine this is the kind of thing a novice Julia user is supposed to conjure up if she wants to substring based on character index rather than byte-in-utf8-encoding index.

Are there String implementations that simply work with 32bit Unicode characters in memory? It might not always be that memory efficient, but my guess is that there are lots of applications where memory is less important and ease of substring manipulation is more important.

Long ago I thought of a string type with an additional index collect(eachindex(s)), but the additional memory needed for that might as well be spent directly on the wider character representations.

1 Like

This blog post is meant to exactly provide the opposite. I do not think any Julia beginner is able to write such code. I made it public so that any Julia beginner can use it (even without understanding the details of it).

For UTF32 encoded strings you can use LegacyStrings.jl:

julia> using LegacyStrings

julia> x = convert(UTF32String, "aą∀∃ę")
"aą∀∃ę"

julia> x[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> x[2]
'ą': Unicode U+0105 (category Ll: Letter, lowercase)

julia> x[3]
'∀': Unicode U+2200 (category Sm: Symbol, math)

julia> x[4]
'∃': Unicode U+2203 (category Sm: Symbol, math)

julia> x[5]
'ę': Unicode U+0119 (category Ll: Letter, lowercase)

@StefanKarpinski - do you know what is the current level of maintenance guarantees for LegacyStrings.jl?

@davidavdav - see also Consider adding kwargs to chop · Issue #37397 · JuliaLang/julia · GitHub as this suggestion was made to the Julia Core team to improve the situation you describe. Maybe you would have some thoughts to share there. Thank you!

I hope I am not mistaken, but there is this PR by @stevengj that was supposed to address a related question via:

using Unicode
graphemes("αααα", 1:2)

However, this doesn’t seem to work in Julia 1.7.3.

From the linked PR, how can we know where/when it is ready to be used? Thank you.

Look in the NEWS.md patch in that PR—it’s slated for Julia 1.9.

See also this discussion and my answers therein: Substring function?

1 Like

Thanks for the advice. I went for a look and it wasn’t next door: I had to go to the “Files changed” tab and then search for “1.9” to finally see on line 177:

Is this the right way?

That’s fine, since many PRs will have a “compatibility” section like this in the docstring.

You can also go to “Files changed” and click on “View file” in the “…” for NEWS.md:


Scrolling to the top, you’ll see it is the NEWS for Julia 1.9.

1 Like

As I commented in the other thread, anyone seeking to index strings this way is probably making a mistake, because Unicode is too complicated for character indexing to be useful.

1 Like