Replace lines of file A using advanced regex and write to file B

Johannes_Schweichhar · February 9, 2023, 4:59pm

Hi,

I 've tried for quite some time now to replace a regex pattern in a large file and to write the result to second file. The replace expression using a dummy string works as expected. With the files I always end up with file_B having the contents of file_A without newlines or being a copy of file_A (depending if I set the keep parameter in the eachline function to false or to true). What am I missing? Thanks in advance!

regex = r"(?<=|lookbehind|)<PATTERN>(?=|lookahead|)"

open("file_B","w") do file
    for line in eachline("file_A")
        ln = replace(String(line), regex=>"<REPLACEMENT>")
	    write(file, ln)
    end
end

jules · February 9, 2023, 7:20pm

Well if you get file A with or without newlines back that means that your replace function isn’t doing anything, right? Are you sure it works as expected? It would help if you could give the real regex and real lines that it should work with.

Johannes_Schweichhar · February 9, 2023, 8:33pm

Thanks. Yes, if I assign the contents of a much smaller dummy file (file_A) to a multiline string and run replace on it the regex works, but not if I use the file directly.

regex = r"(?<=[ATCGN]{10})\n(?=[ATCGN]{10})"

Link to dummy file (850kb):
https://send.vis.ee/download/392df9b35ac68073/#2Kpcv3Yh5Q_0Kbpi-dNWWg

Jeff_Emanuel · February 9, 2023, 9:17pm

A regex trying to match something following \n will fail when applied to the lines read from a file because you will not get any lines that have anything beyond \n. Anything beyond \n will be in the next line.

Johannes_Schweichhar · February 9, 2023, 9:43pm

That’s true. However, I also ran the following regex with the same result.

regex = r"(?<=[ATGCN]{10})\n"

rafael.guerra · February 9, 2023, 9:51pm

Would this work:

regex = r"(?<=[ATCGN]{10})\n(?=[ATCGN]{10})"
str = join(readlines("SSU_dummy.txt"),'\n')
write("SSU_dummy_OUT1.txt", replace(str, regex=>"REPLACEMENT"))

Jeff_Emanuel · February 9, 2023, 9:56pm

https://docs.julialang.org/en/v1/base/io-network/#Base.readline

When keep is false (as it is by default), these trailing newline characters are removed from the line before it is returned

Use $ instead to match the end of the line or pass keep=true.

stevengj · February 9, 2023, 10:10pm

The fastest and simplest thing is obviously to read the whole file into a string and then write it back out:

s = read("file_A", String)
write("file_B", replace(s, regex=>replacement))

If the file is too big to read into memory, then speed is also at a premium and processing it line-by-line will be slow. In such cases of truly huge files, I would mmap it to a StringView and loop through it one replacement at a time (letting the operating system take care of paging things in and out of memory). e.g. something like

using Mmap, StringViews
open("file_B", "w") do out
    s = StringView(mmap("file_A"))
    pos = 1
    for m in eachmatch(regex, s)
        @views write(out, s[pos:prevind(s, m.offset)])
        pos = m.offset + ncodeunits(m.match) # next char after match
        write(out, replacement)
    end
    @views write(out, s[pos:end]) # write remaining data
end

rafael.guerra · February 9, 2023, 10:31pm

In this case, OP’s regex would have to be modified (at least in Windows) to replace \n by \r\n:

regex = r"(?<=[ATCGN]{10})\r\n(?=[ATCGN]{10})"

Johannes_Schweichhar · February 9, 2023, 10:48pm

@stevengj

Great, thanks! Both solutions worked!

The second used up to 500x less memory and was 4-5 faster.

\r was part of the problem with non-matching RegEx.

Details: After removing all \r beforehand, matching worked as expected (except for eachline() with the multiline pattern). Strangely for me \r didn’t show up in text editors (vim & vscode) when opening the files, and appeared only after reading them with Julia functions (but maybe that was what @Palli referred to?).

rafael.guerra · February 9, 2023, 11:02pm

On my PC using your input file, Steve’s 2nd solution was 40% faster than the first, and it did replace the pattern.

Palli · February 10, 2023, 12:09am

As you (now) know eachline is incompatible with multi-line regex, as it would in other languages (for others reading the original question, the solution may seem complex, StringView is rather advanced, and not usually needed [for new users]).

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos.

It seems PCRE and thus Julia does support it.

No, then it only works on Windows, so FYI, the above is better.

The fastest and simplest thing would be to use readline, when multi-line isn’t needed…

stevengj · February 10, 2023, 12:56am

Better to use \r?\n so it matches both CRLF (Windows) and LF (everything else), or to use \R (which additionally matches various other Unicode line breaks) as @Palli suggested.

stevengj · February 10, 2023, 4:30am

I agree that this is annoyingly complicated. It’s even worse if you want to do more advanced s"foo" pattern substitutions, or multiple replacements. So I proposed a PR https://github.com/JuliaLang/julia/pull/48625 that should simplify it to:

using Mmap, StringViews
s = StringView(mmap("input_file"))
open(out -> replace(out, s, pat=>repl), "output_file", "w")

rafael.guerra · February 10, 2023, 4:49am

Isn’t the proposed optional out argument missing in replace(), so thar it uses s defined above?

stevengj · February 10, 2023, 12:39pm

Whoops, it was missing the s argument.

mkitti · January 10, 2024, 10:23pm

The initial IO argument for output was implemented in Julia 1.10 via pull request 48625

Topic		Replies	Views
Handling multiple files at the same time Performance question , strings , file	6	959	April 26, 2022
Streaming replace of file content before reading into lines General Usage	3	379	March 15, 2021
Regex: ERROR: PCRE error: requested value is not set New to Julia regex	4	624	May 9, 2020
Perform multiple replacements on a string in a single pass Performance strings , regex	19	9561	January 18, 2022
Regex capture next line in text file General Usage question , regex	7	356	July 9, 2023

Replace lines of file A using advanced regex and write to file B

Related topics