Regular expressions returning offsets in bytes not characters

cstook · July 6, 2017, 3:26am

Hello,

Regular expressions seems to be returning offsets in bytes not characters. Is this the intended behavior? Is there a way to get the offsets in characters?

julia> m1 = match(r"(3).*(5)"ix,"123a56789")
RegexMatch("3a5", 1="3", 2="5")

julia> print(m1.offsets)
[3,5]
julia> m2 = match(r"(3).*(5)"ix,"123α56789")
RegexMatch("3α5", 1="3", 2="5")

julia> print(m2.offsets)
[3,6]
julia>

Thanks

ScottPJones · July 6, 2017, 3:35am

That’s just the way it was designed.
If you want offsets of the Unicode codepoints, you could use the LegacyStrings package, and use the UTF32String type.

cstook · July 6, 2017, 4:55am

OK. Thanks.

johann.spies · July 6, 2017, 6:47am

Or maybe (I am not sure if this is what you want):

julia>  m1 = match(r"(?<first>3).*(?<second>5)"ix,"123a56789")
RegexMatch("3a5", first="3", second="5")

julia> m1.offsets
2-element Array{Int64,1}:
 3
 5

julia> m1[:first]
"3"

julia> m1[:second]
"5"

Regards
Johann

nalimilan · July 6, 2017, 7:48am

You don’t need to use UTF32String nor LegacyStrings. Maybe you can tell us more about what you want to do?

String indices are in bytes in Julia (at least for the default String type) because that’s the only efficient way of accessing a character in variable length encodings like UTF-8 or UTF-16. Counting characters requires iterating over the string from its beginning.

If what you really need is the number of characters before the first match, you can just do something like length(s[1:m1.offsets[1]]) or (a bit more efficient) length(SubString(s, 1, m1.offsets[1])). But beware that “character” is a subtle notion, which does not necessarily correspond to Unicode codepoints. See graphemes if what you need is the user-perceived number of characters.

waldyrious · July 6, 2017, 9:30am

Docs link for convenience: https://docs.julialang.org/en/stable/stdlib/strings/#Base.UTF8proc.graphemes

stevengj · July 6, 2017, 12:37pm

You can get the offset in characters from the ind2chr function.

The reason that they return the offsets in bytes (“code units” of the underlying UTF-8 encoding) is this is how Julia String is indexed, so byte offsets are usually the most useful thing to know (e.g. to extract substrings from the original string).

julia> s = "123α56789"
"123α56789"

julia> m2 = match(r"(3).*(5)"ix, s)
RegexMatch("3α5", 1="3", 2="5")

julia> m2.offsets
2-element Array{Int64,1}:
 3
 6

julia> s[m2.offsets]
"35"

julia> ind2chr.(s, m2.offsets)
2-element Array{Int64,1}:
 3
 5

stevengj · July 6, 2017, 12:41pm

Don’t do this. Use ind2chr.

But you’re right that you may want to use graphemes, e.g. length(graphemes(SubString(s, 1, m2.offsets[2]))), if you want to count user-perceived characters.

cstook · July 6, 2017, 10:27pm

I was using a mixture of length and match.offsets to compute indices into strings. This was very bad. Switching to using only match.offsets fixed the problem.

Thank everyone for your help.

stevengj · July 7, 2017, 1:19pm

Never use length(s) for indexing strings; use endof(s) instead.

Topic		Replies	Views
Removing characters from String General Usage strings	15	12212	January 26, 2021
Regex on byte vector General Usage regex	10	1588	November 10, 2020
Regex matching when string has non-ascii unicode General Usage strings , regex , unicode	3	2734	October 26, 2021
Correct usage of regex matches New to Julia regex	5	699	May 9, 2021
Searching for a regular expression inside an array New to Julia	16	5902	October 15, 2018

Regular expressions returning offsets in bytes not characters

Related topics