I 've tried for quite some time now to replace a regex pattern in a large file and to write the result to second file. The replace expression using a dummy string works as expected. With the files I always end up with file_B having the contents of file_A without newlines or being a copy of file_A (depending if I set the keep parameter in the eachline function to false or to true). What am I missing? Thanks in advance!
regex = r"(?<=|lookbehind|)<PATTERN>(?=|lookahead|)"
open("file_B","w") do file
for line in eachline("file_A")
ln = replace(String(line), regex=>"<REPLACEMENT>")
write(file, ln)
end
end
Well if you get file A with or without newlines back that means that your replace function isn’t doing anything, right? Are you sure it works as expected? It would help if you could give the real regex and real lines that it should work with.
Thanks. Yes, if I assign the contents of a much smaller dummy file (file_A) to a multiline string and run replace on it the regex works, but not if I use the file directly.
A regex trying to match something following \n will fail when applied to the lines read from a file because you will not get any lines that have anything beyond \n. Anything beyond \n will be in the next line.
The fastest and simplest thing is obviously to read the whole file into a string and then write it back out:
s = read("file_A", String)
write("file_B", replace(s, regex=>replacement))
If the file is too big to read into memory, then speed is also at a premium and processing it line-by-line will be slow. In such cases of truly huge files, I would mmap it to a StringView and loop through it one replacement at a time (letting the operating system take care of paging things in and out of memory). e.g. something like
using Mmap, StringViews
open("file_B", "w") do out
s = StringView(mmap("file_A"))
pos = 1
for m in eachmatch(regex, s)
@views write(out, s[pos:prevind(s, m.offset)])
pos = m.offset + ncodeunits(m.match) # next char after match
write(out, replacement)
end
@views write(out, s[pos:end]) # write remaining data
end
The second used up to 500x less memory and was 4-5 faster.
\r was part of the problem with non-matching RegEx.
Details: After removing all \r beforehand, matching worked as expected (except for eachline() with the multiline pattern). Strangely for me \r didn’t show up in text editors (vim & vscode) when opening the files, and appeared only after reading them with Julia functions (but maybe that was what @Palli referred to?).
As you (now) know eachline is incompatible with multi-line regex, as it would in other languages (for others reading the original question, the solution may seem complex, StringView is rather advanced, and not usually needed [for new users]).
Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos.
It seems PCRE and thus Julia does support it.
No, then it only works on Windows, so FYI, the above is better.
The fastest and simplest thing would be to use readline, when multi-line isn’t needed…
Better to use \r?\n so it matches both CRLF (Windows) and LF (everything else), or to use \R (which additionally matches various other Unicode line breaks) as @Palli suggested.
I agree that this is annoyingly complicated. It’s even worse if you want to do more advanced s"foo" pattern substitutions, or multiple replacements. So I proposed a PR https://github.com/JuliaLang/julia/pull/48625 that should simplify it to:
using Mmap, StringViews
s = StringView(mmap("input_file"))
open(out -> replace(out, s, pat=>repl), "output_file", "w")