I’m fairly next to Julia and was wondering if anyone had some advice. I’m loving the regex feature which is working great on a line by line query for example
open(filename) do file
for (index, ln) in enumerate(eachline(file))
m = match(r"\s+<minimizedAffinity>\s*",ln)
if m != nothing
print(m)
end
However I struggling with how to do read in the whole file and find matches for a multiline pattern. i.e the float value that comes after the keyword.
How large is this file? If you can feasibly load the entire thing as a string I think you can just run that query (though you’ll want to remove the comma). Otherwise, you can do
function findnumber(file)
lookfornumber = false
for line in eachline(file)
if lookfornumber
m = match(<+just the number pattern+>, line)
isnothing(m) || return m
lookfornumber = false
end
m = match(<+just the preceding line pattern+>, line)
if !isnothing(m)
lookfornumber = true
end
end
return nothing
end
m = open(findnumber, filename)
Because this pattern depends on two lines, the Iterators.Stateful trick can manage this logic. Try:
function findnumber2(file)
itr = Iterators.Stateful(eachline(file))
for line in itr
m = match(r"Number:", line)
if !isnothing(m)
n = match(r"\d+", something(peek(itr), ""))
isnothing(n) || return n
end
end
return nothing
end
For example, for the above RegExs, the following file matches:
just
plain
324
63463
Number:
234234
fsdf
sdfg
Number:
sdfsf
This avoids the need for the somethink() function (although it was nice to know about it).
using IterTools
function findnumber3(tf)
itr = partition(eachline(tf),2,1)
m=nothing
for tl in itr
m = isnothing(match(r"Number:", first(tl))) ? continue : match(r"\d+", last(tl))
!isnothing(m) && return m
end
end
I still have doubts about the need (apart from the usefulness in testing) of the Iterators.statefull() function
PS
this formulation also holds for the more general problem where the lines to be inspected are k-positions apart
Even easier: mmap the file, wrap a StringView around it, and then run the desired multi-line regex query:
using StringViews, Mmap
open(filename, "r") do io
s = StringView(mmap(io))
for m in eachmatch(regex, s)
# do something
end
end
This way, the operating system will page the file into memory as needed, even if it is huge, but it will still act like a string you loaded all at once.
As a bonus, this will also avoid the eachline performance cost of allocating a new string for every line. (Though this can be done more efficiently with the ViewReader.jl package and with the upcoming copyuntil in Julia 1.11.)