SubString doesn't work with unicode

nicolas · June 8, 2020, 8:16am

It seems SubString works well with standard characters, but not for unicode characters. Below is an example in Julia 1.4.2:

julia> SubString("aaaa",1,2)
"aa"

julia> SubString("αααα",1,2)
ERROR: StringIndexError("αααα", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:12
 [2] SubString at .\strings\substring.jl:32 [inlined]
 [3] SubString(::String, ::Int64, ::Int64) at .\strings\substring.jl:38
 [4] top-level scope at REPL[79]:1

pfitzseb · June 8, 2020, 8:55am

Julia indexes into strings by code unit, not by character. Take a look at the relevant docs section for how to correctly handle Unicode.

nicolas · June 9, 2020, 8:10am

Thanks a lot. I didn’t think of that since variable names with unicode worked for me thus far. But seeing how delicate character encoding can be, it makes complete sense to have a sound basis within Julia to incorporate unicode into strings.

davidavdav · June 13, 2022, 4:13pm

Hello,

I regularly stumble into this. There seem to be many String supporting packages, but I didn’t quickly find one that simply provided unicode-point indexing.

For now, a quick hack could be (no bounds checking)

function substring(s::AbstractString, i::Int, j::Int)
  chars = collect(c for c in s)
  return join(chars[i : j])
end

but this isn’t very efficient for repeated operation on the same string.

bkamins · June 13, 2022, 4:17pm

StevenSiew · June 13, 2022, 11:16pm

Is this in the Julia Forem?

davidavdav · June 17, 2022, 7:37am

Thanks, the macro looks impressive!

However I can’t imagine this is the kind of thing a novice Julia user is supposed to conjure up if she wants to substring based on character index rather than byte-in-utf8-encoding index.

Are there String implementations that simply work with 32bit Unicode characters in memory? It might not always be that memory efficient, but my guess is that there are lots of applications where memory is less important and ease of substring manipulation is more important.

Long ago I thought of a string type with an additional index collect(eachindex(s)), but the additional memory needed for that might as well be spent directly on the wider character representations.

bkamins · June 17, 2022, 8:05am

This blog post is meant to exactly provide the opposite. I do not think any Julia beginner is able to write such code. I made it public so that any Julia beginner can use it (even without understanding the details of it).

For UTF32 encoded strings you can use LegacyStrings.jl:

julia> using LegacyStrings

julia> x = convert(UTF32String, "aą∀∃ę")
"aą∀∃ę"

julia> x[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> x[2]
'ą': Unicode U+0105 (category Ll: Letter, lowercase)

julia> x[3]
'∀': Unicode U+2200 (category Sm: Symbol, math)

julia> x[4]
'∃': Unicode U+2203 (category Sm: Symbol, math)

julia> x[5]
'ę': Unicode U+0119 (category Ll: Letter, lowercase)

@StefanKarpinski - do you know what is the current level of maintenance guarantees for LegacyStrings.jl?

bkamins · June 17, 2022, 8:18am

@davidavdav - see also Consider adding kwargs to chop · Issue #37397 · JuliaLang/julia · GitHub as this suggestion was made to the Julia Core team to improve the situation you describe. Maybe you would have some thoughts to share there. Thank you!

rafael.guerra · June 17, 2022, 7:43pm

I hope I am not mistaken, but there is this PR by @stevengj that was supposed to address a related question via:

using Unicode
graphemes("αααα", 1:2)

However, this doesn’t seem to work in Julia 1.7.3.

From the linked PR, how can we know where/when it is ready to be used? Thank you.

stevengj · June 17, 2022, 9:02pm

Look in the NEWS.md patch in that PR—it’s slated for Julia 1.9.

See also this discussion and my answers therein: Substring function?

rafael.guerra · June 17, 2022, 9:20pm

Thanks for the advice. I went for a look and it wasn’t next door: I had to go to the “Files changed” tab and then search for “1.9” to finally see on line 177:

Is this the right way?

stevengj · June 17, 2022, 10:02pm

That’s fine, since many PRs will have a “compatibility” section like this in the docstring.

You can also go to “Files changed” and click on “View file” in the “…” for NEWS.md:

Scrolling to the top, you’ll see it is the NEWS for Julia 1.9.

stevengj · June 17, 2022, 10:07pm

As I commented in the other thread, anyone seeking to index strings this way is probably making a mistake, because Unicode is too complicated for character indexing to be useful.

Topic		Replies	Views
Substring function? New to Julia strings , unicode	42	4012	July 18, 2022
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2517	January 12, 2024
Julia substring return empty string New to Julia	8	1015	April 23, 2019
Replacing strings in specific position indices as in `str_sub()` in `stringr` General Usage strings , unicode	9	1700	November 14, 2022
Weird string slicing in korean Performance	3	479	December 29, 2022

SubString doesn't work with unicode

Related topics