Replace lines of file A using advanced regex and write to file B

Hi,

I 've tried for quite some time now to replace a regex pattern in a large file and to write the result to second file. The replace expression using a dummy string works as expected. With the files I always end up with file_B having the contents of file_A without newlines or being a copy of file_A (depending if I set the keep parameter in the eachline function to false or to true). What am I missing? Thanks in advance!

regex = r"(?<=|lookbehind|)<PATTERN>(?=|lookahead|)"

open("file_B","w") do file
    for line in eachline("file_A")
        ln = replace(String(line), regex=>"<REPLACEMENT>")
	    write(file, ln)
    end
end
1 Like

Well if you get file A with or without newlines back that means that your replace function isn’t doing anything, right? Are you sure it works as expected? It would help if you could give the real regex and real lines that it should work with.

Thanks. Yes, if I assign the contents of a much smaller dummy file (file_A) to a multiline string and run replace on it the regex works, but not if I use the file directly.

regex = r"(?<=[ATCGN]{10})\n(?=[ATCGN]{10})"

Link to dummy file (850kb):
https://send.vis.ee/download/392df9b35ac68073/#2Kpcv3Yh5Q_0Kbpi-dNWWg

A regex trying to match something following \n will fail when applied to the lines read from a file because you will not get any lines that have anything beyond \n. Anything beyond \n will be in the next line.

2 Likes

That’s true. However, I also ran the following regex with the same result.

regex = r"(?<=[ATGCN]{10})\n"

Would this work:

regex = r"(?<=[ATCGN]{10})\n(?=[ATCGN]{10})"
str = join(readlines("SSU_dummy.txt"),'\n')
write("SSU_dummy_OUT1.txt", replace(str, regex=>"REPLACEMENT"))

https://docs.julialang.org/en/v1/base/io-network/#Base.readline

When keep is false (as it is by default), these trailing newline characters are removed from the line before it is returned

Use $ instead to match the end of the line or pass keep=true.

1 Like

The fastest and simplest thing is obviously to read the whole file into a string and then write it back out:

s = read("file_A", String)
write("file_B", replace(s, regex=>replacement))

If the file is too big to read into memory, then speed is also at a premium and processing it line-by-line will be slow. In such cases of truly huge files, I would mmap it to a StringView and loop through it one replacement at a time (letting the operating system take care of paging things in and out of memory). e.g. something like

using Mmap, StringViews
open("file_B", "w") do out
    s = StringView(mmap("file_A"))
    pos = 1
    for m in eachmatch(regex, s)
        @views write(out, s[pos:prevind(s, m.offset)])
        pos = m.offset + ncodeunits(m.match) # next char after match
        write(out, replacement)
    end
    @views write(out, s[pos:end]) # write remaining data
end
8 Likes

In this case, OP’s regex would have to be modified (at least in Windows) to replace \n by \r\n:

regex = r"(?<=[ATCGN]{10})\r\n(?=[ATCGN]{10})"
1 Like

@stevengj

Great, thanks! Both solutions worked!

The second used up to 500x less memory and was 4-5 faster.

\r was part of the problem with non-matching RegEx.

Details: After removing all \r beforehand, matching worked as expected (except for eachline() with the multiline pattern). Strangely for me \r didn’t show up in text editors (vim & vscode) when opening the files, and appeared only after reading them with Julia functions (but maybe that was what @Palli referred to?).

On my PC using your input file, Steve’s 2nd solution was 40% faster than the first, and it did replace the pattern.

As you (now) know eachline is incompatible with multi-line regex, as it would in other languages (for others reading the original question, the solution may seem complex, StringView is rather advanced, and not usually needed [for new users]).

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos.

It seems PCRE and thus Julia does support it.

No, then it only works on Windows, so FYI, the above is better.

The fastest and simplest thing would be to use readline, when multi-line isn’t needed…

1 Like

Better to use \r?\n so it matches both CRLF (Windows) and LF (everything else), or to use \R (which additionally matches various other Unicode line breaks) as @Palli suggested.

2 Likes

I agree that this is annoyingly complicated. It’s even worse if you want to do more advanced s"foo" pattern substitutions, or multiple replacements. So I proposed a PR https://github.com/JuliaLang/julia/pull/48625 that should simplify it to:

using Mmap, StringViews
s = StringView(mmap("input_file"))
open(out -> replace(out, s, pat=>repl), "output_file", "w")
6 Likes

Isn’t the proposed optional out argument missing in replace(), so thar it uses s defined above?

1 Like

Whoops, it was missing the s argument.

The initial IO argument for output was implemented in Julia 1.10 via pull request 48625

2 Likes