Regular expressions returning offsets in bytes not characters

question
regex

#1

Hello,

Regular expressions seems to be returning offsets in bytes not characters. Is this the intended behavior? Is there a way to get the offsets in characters?

julia> m1 = match(r"(3).*(5)"ix,"123a56789")
RegexMatch("3a5", 1="3", 2="5")

julia> print(m1.offsets)
[3,5]
julia> m2 = match(r"(3).*(5)"ix,"123α56789")
RegexMatch("3α5", 1="3", 2="5")

julia> print(m2.offsets)
[3,6]
julia>

Thanks


#2

That’s just the way it was designed.
If you want offsets of the Unicode codepoints, you could use the LegacyStrings package, and use the UTF32String type.


#3

OK. Thanks.


#4

Or maybe (I am not sure if this is what you want):

julia>  m1 = match(r"(?<first>3).*(?<second>5)"ix,"123a56789")
RegexMatch("3a5", first="3", second="5")

julia> m1.offsets
2-element Array{Int64,1}:
 3
 5

julia> m1[:first]
"3"

julia> m1[:second]
"5"

Regards
Johann


#5

You don’t need to use UTF32String nor LegacyStrings. Maybe you can tell us more about what you want to do?

String indices are in bytes in Julia (at least for the default String type) because that’s the only efficient way of accessing a character in variable length encodings like UTF-8 or UTF-16. Counting characters requires iterating over the string from its beginning.

If what you really need is the number of characters before the first match, you can just do something like length(s[1:m1.offsets[1]]) or (a bit more efficient) length(SubString(s, 1, m1.offsets[1])). But beware that “character” is a subtle notion, which does not necessarily correspond to Unicode codepoints. See graphemes if what you need is the user-perceived number of characters.


#6

Docs link for convenience: https://docs.julialang.org/en/stable/stdlib/strings/#Base.UTF8proc.graphemes


#7

You can get the offset in characters from the ind2chr function.

The reason that they return the offsets in bytes (“code units” of the underlying UTF-8 encoding) is this is how Julia String is indexed, so byte offsets are usually the most useful thing to know (e.g. to extract substrings from the original string).

julia> s = "123α56789"
"123α56789"

julia> m2 = match(r"(3).*(5)"ix, s)
RegexMatch("3α5", 1="3", 2="5")

julia> m2.offsets
2-element Array{Int64,1}:
 3
 6

julia> s[m2.offsets]
"35"

julia> ind2chr.(s, m2.offsets)
2-element Array{Int64,1}:
 3
 5

#8

Don’t do this. Use ind2chr.

But you’re right that you may want to use graphemes, e.g. length(graphemes(SubString(s, 1, m2.offsets[2]))), if you want to count user-perceived characters.


#9

I was using a mixture of length and match.offsets to compute indices into strings. This was very bad. Switching to using only match.offsets fixed the problem.

Thank everyone for your help.


#10

Never use length(s) for indexing strings; use endof(s) instead.