Regex capture next line in text file

I’m fairly next to Julia and was wondering if anyone had some advice. I’m loving the regex feature which is working great on a line by line query for example

open(filename) do file
        for (index, ln) in enumerate(eachline(file))
            m = match(r"\s+<minimizedAffinity>\s*",ln)
            if m != nothing
print(m)
end

However I struggling with how to do read in the whole file and find matches for a multiline pattern. i.e the float value that comes after the keyword.

(\s+<minimizedAffinity>[\n,\r]([-+]?[0-9]*\.?[0-9]+(e[-+]?[0-9]+)?))

the goal to extract a list of float values from the text file.

I can faff around with indexing the lines but there must be a more elegant solution?

How large is this file? If you can feasibly load the entire thing as a string I think you can just run that query (though you’ll want to remove the comma). Otherwise, you can do

function findnumber(file)
    lookfornumber = false
    for line in eachline(file)
        if lookfornumber
            m = match(<+just the number pattern+>, line)
            isnothing(m) || return m
            lookfornumber = false
        end
        m = match(<+just the preceding line pattern+>, line)
        if !isnothing(m)
            lookfornumber = true
        end
    end
    return nothing
end
m = open(findnumber, filename)

Ah yes, that’s a clever way of doing it. Thanks you this worked!!

Because this pattern depends on two lines, the Iterators.Stateful trick can manage this logic. Try:

function findnumber2(file)
    itr = Iterators.Stateful(eachline(file))
    for line in itr
        m = match(r"Number:", line)
        if !isnothing(m)
            n = match(r"\d+", something(peek(itr), ""))
            isnothing(n) || return n
        end
    end
    return nothing
end

For example, for the above RegExs, the following file matches:

just
plain
324
63463
Number:
234234
fsdf
sdfg
Number:
sdfsf

and in Julia with the above function:

julia> testfile = IOBuffer("just\nplain\n324\n63463\nNumber:\n234234\nfsdf\nsdfg\nNumber:\nsdfsf")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=59, maxsize=Inf, ptr=1, mark=-1)

julia> findnumber2(testfile)
RegexMatch("234234")

gustaphe’s answer findnumber finds this match as well.

3 Likes

@Dan, why do we need something(peek(itr),"") and peek(itr) is not enough? Thank you.

I suppose is to manage the case when the last line of the file is “Number:”

function findnumber21(file)
    itr = Iterators.Stateful(eachline(file))
    for line in itr
        m = match(r"Number:", line)
        if !isnothing(m)
            n = match(r"\d+", peek(itr))
            isnothing(n) || return n
        end
    end
    return nothing
end

julia> str="just\nplain\n324\n63463\nfsdf\nsdfg\nsdfsf\nNumber:"
"just\nplain\n324\n63463\nfsdf\nsdfg\nsdfsf\nNumber:"

julia> tf=IOBuffer(f)
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=61, maxsize=Inf, ptr=1, mark=-1)

julia> findnumber21(tf)
ERROR: MethodError: no method matching match(::Regex, ::Nothing)

Closest candidates are:
  match(::Regex, ::Union{SubString{String}, String}, ::Integer)
   @ Base regex.jl:374
  match(::Regex, ::Union{SubString{String}, String}, ::Integer, ::UInt32)
   @ Base regex.jl:374
  match(::Regex, ::InlineString, ::Integer)
   @ InlineStrings C:\Users\sprmn\.julia\packages\InlineStrings\rlLZO\src\InlineStrings.jl:711
  ...

Stacktrace:
 [1] findnumber21(file::IOBuffer)
   @ Main c:\Users\sprmn\.julia\environments\v1.9.0\regex_file.jl:36
 [2] top-level scope
   @ c:\Users\sprmn\.julia\environments\v1.9.0\regex_file.jl:46

julia> findnumber2(tf)
2 Likes

This avoids the need for the somethink() function (although it was nice to know about it).

using IterTools
function findnumber3(tf)
    itr = partition(eachline(tf),2,1)
    m=nothing
    for tl in itr
        m = isnothing(match(r"Number:", first(tl))) ? continue : match(r"\d+", last(tl))
        !isnothing(m) && return m
    end
end

I still have doubts about the need (apart from the usefulness in testing) of the Iterators.statefull() function

PS
this formulation also holds for the more general problem where the lines to be inspected are k-positions apart

Even easier: mmap the file, wrap a StringView around it, and then run the desired multi-line regex query:

using StringViews, Mmap
open(filename, "r") do io
    s = StringView(mmap(io))
    for m in eachmatch(regex, s)
       # do something
    end
end

This way, the operating system will page the file into memory as needed, even if it is huge, but it will still act like a string you loaded all at once.

As a bonus, this will also avoid the eachline performance cost of allocating a new string for every line. (Though this can be done more efficiently with the ViewReader.jl package and with the upcoming copyuntil in Julia 1.11.)

2 Likes