How to (efficiently) filter a file/stream on-the-fly?

,

I have a fixed-width output file from a Fortran program, in a well-defined matrix format, except for the occasional lines which I can regard as comments. I would like to pass the file to something like DelimitedFiles.readdlm() without it trying to parse the comment rows.

Currently, my solution is to read the whole file into memory, create a in-memory stream buffer, then write the non-comment lines to that stream, which readdlm() can parse.

datarow_predicate = !startswith("3333") # comments start with 3333

function filterstream(filename::AbstractString; predicate = datarow_predicate)
	filteredstream = IOBuffer()
	infilestream = open(filename) # |> GzipDecompressorStream
	for line in eachline(infilestream)
		if predicate(line)
			write(filteredstream, line)
			write(filteredstream, "\n")
		end
	end
	close(infilestream)

	seekstart(filteredstream)
	return filteredstream
end

example_matrix = readdlm(filteredstream("out.dat"))

But this seems like a bad idea, because it reads everything into memory. On Python, a better approach would be to create a generator function that yields every line if it passes the test. Is there a way to do something similar in Julia? That is, some sort of function/technique that reads and filters lines from a file/stream on-demand as a function like readdlm() asks for them?

For fixed-width tables, you may try FixedWidthTables.jl. Specifically, skiprows_startwith argument.

But a generic and efficient mechanism to filter/modify lines in an IO stream would be nice indeed!..

2 Likes

If you are already looping over the file, why not directly parse the lines you need?

(ps: I wish CSV.jl had filtering options for these kind of requirement)

Have you tried Iterators.filter together with CSV.Rows as in this solution?

Or TableOperations.filter as suggested further down in the same thread.

Tried something like that now, thanks. But for the moment the best I could obtain was using eachsplit. The issue with CSV.Rows is that the “issues” (instabilities, for example) are then occurring inside the CSV machinery and that gets much harder to debug. But something will come out of that, I’m getting closer to the performance of the direct CSV reading to a DataFrame when I adjust the file to allow that.

1 Like

Abhro

When you have this finally figured out, could you post a MWE that can process a simple textfile containing integers and comments

4
6
# This is a comment
  78
     # This is also a comment
 8
     12  

This you can do with DelimitedFiles.readdlm:

julia> data = """
       4
       6
       # This is a comment
         78
            # This is also a comment
        8
            12
       """
"4\n6\n# This is a comment\n  78\n     # This is also a comment\n 8\n     12\n"

julia> readdlm(IOBuffer(data), Int; comments=true, comment_char='#')
5Ă—1 Matrix{Int64}:
  4
  6
 78
  8
 12