Regex on byte vector

kernelmethod · October 5, 2020, 12:19am

Hey there! In Python, I can apply a regular expression to a byte string using something like

import re
patt = re.compile(b"^ab+c$")
patt.match(b"abbbc")

Is there a way of doing something similar in Julia? I know I can do something like

my_occursin!(patt::Regex, x::Vector{UInt8}) = occursin(patt, String(x))

but this isn’t quite the same because String(x) mutates x:

julia> x = Vector{UInt8}("abc")
3-element Array{UInt8,1}:
 0x61
 0x62
 0x63

julia> String(x)
"abc"

julia> x
UInt8[]

Is there a way to apply a regex to a Vector{UInt8} that doesn’t require mutating the byte vector?

Thanks!

lmiq · October 5, 2020, 12:37am

I don’t know if there is a better solution, but in the worst case, you can copy x:

my_occursin!(patt::Regex, x::Vector{UInt8}) = occursin(patt, copy(String(x)))

Actually you can use a view:

julia> x = Vector{UInt8}("abbbc")
5-element Array{UInt8,1}:
 0x61
 0x62
 0x62
 0x62
 0x63

julia> occursin(Regex("^ab..c"),String(@view x[1:end]))
true

julia> x
5-element Array{UInt8,1}:
 0x61
 0x62
 0x62
 0x62
 0x63

kernelmethod · October 5, 2020, 12:47am

The view-based idea is good, but I think that under the hood it’s also just creating a copy of x: https://github.com/JuliaLang/julia/blob/55a6dab76329b693f0fab372b1a80289bff01a90/base/strings/string.jl#L51

lmiq · October 5, 2020, 12:49am

Indeed, the help entry for String is explicit in suggesting the copy, with some reasons given.

help?> String
search: String string StringIndexError Cstring Cwstring bitstring SubString include_string setrounding unsafe_string AbstractString

  String(v::AbstractVector{UInt8})

  Create a new String object from a byte vector v containing UTF-8 encoded characters. If v is Vector{UInt8} it will be truncated to
  zero length and future modification of v cannot affect the contents of the resulting string. To avoid truncation use
  String(copy(v)).

  When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the case
  for byte vectors returned by take! on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O
  data to strings. In other cases, Vector{UInt8} data may be copied, but v is truncated anyway to guarantee consistent behavior.

  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  String(s::AbstractString)

  Convert a string to a contiguous byte array representation encoded as UTF-8 bytes. This representation is often appropriate for
  passing strings to C.

stevengj · October 5, 2020, 2:49am

Certainly the low-level PCRE library supports this, so it would be possible to implement by replicating or generalizing some of the code in Base.

(This has come up a few times; I’ve often thought that it might be useful to put together a StringView type that wraps around any a::AbstractVector{UInt8} with stride(a,1) == 1 and exposes the regex methods etcetera.)

kernelmethod · October 5, 2020, 5:09am

Sure, I’ll try out this approach If I get anywhere with it I’ll open a PR, since I imagine it’s probably generally useful functionality that more people could do with.

kernelmethod · October 9, 2020, 4:44am

I’ve opened up an issue (https://github.com/JuliaLang/julia/issues/37956) in case anybody is interested in taking this on and/or tracking its progress.

mgkuhn · October 10, 2020, 8:45pm

I think this feature request is a symptom of a far more wide-ranging problem that should be solved instead: namely that Julia stdlib still lacks a version of String suitable for processing binary and other non-UTF-8 string data, with all the facilities that String offers, including, but by far not limited to, regular expressions. I would therefore suggest that instead feature request #37979 is a more generic solution the the same problem, namely making the entire AbstractString API (including regex) easily available for processing byte sequences where Unicode is not of interest, by adding a binary/byte/basic-latin sibling of String which could be called BString . Basically a 1 byte = 1 character version of String without a UTF-8 decoder running behind the scenes all the time. Vector{UInt8} (mutable) seems more a workaround for the lack of an immutable binary string type.

stevengj · October 11, 2020, 2:32am

What does this mean? ASCII? What is a “character” if it is not a Unicode codepoint?

Besides, the PCRE regex library used by Julia expects UTF-8.

Tamas_Papp · October 11, 2020, 8:18am

I am not sure why this cannot just be an additional package. Cf Kristoffer’s suggestion in the issue you opened.

That said, what’s the use case for

besides ASCII? EBCDIC or historical 8-bit codepages? Why not just convert those to UTF8?

stevengj · November 10, 2020, 9:30pm

Update: I created a draft package for this:

Topic		Replies	Views
Correct usage of regex matches New to Julia regex	5	712	May 9, 2021
Removing characters from String General Usage strings	15	12274	January 26, 2021
Regular expressions returning offsets in bytes not characters General Usage question , regex	9	1278	July 7, 2017
Flaw in Regex support for String Internals & Design strings , regex	38	3135	May 4, 2018
Help converting python to julia? New to Julia strings , regex	2	730	December 8, 2019

Regex on byte vector

Related topics