How to (efficiently) filter a file/stream on-the-fly?

Abhro · November 9, 2024, 9:33pm

I have a fixed-width output file from a Fortran program, in a well-defined matrix format, except for the occasional lines which I can regard as comments. I would like to pass the file to something like DelimitedFiles.readdlm() without it trying to parse the comment rows.

Currently, my solution is to read the whole file into memory, create a in-memory stream buffer, then write the non-comment lines to that stream, which readdlm() can parse.

datarow_predicate = !startswith("3333") # comments start with 3333

function filterstream(filename::AbstractString; predicate = datarow_predicate)
	filteredstream = IOBuffer()
	infilestream = open(filename) # |> GzipDecompressorStream
	for line in eachline(infilestream)
		if predicate(line)
			write(filteredstream, line)
			write(filteredstream, "\n")
		end
	end
	close(infilestream)

	seekstart(filteredstream)
	return filteredstream
end

example_matrix = readdlm(filteredstream("out.dat"))

But this seems like a bad idea, because it reads everything into memory. On Python, a better approach would be to create a generator function that yields every line if it passes the test. Is there a way to do something similar in Julia? That is, some sort of function/technique that reads and filters lines from a file/stream on-demand as a function like readdlm() asks for them?

aplavin · November 10, 2024, 2:45am

For fixed-width tables, you may try FixedWidthTables.jl. Specifically, skiprows_startwith argument.

But a generic and efficient mechanism to filter/modify lines in an IO stream would be nice indeed!..

lmiq · November 10, 2024, 11:46am

If you are already looping over the file, why not directly parse the lines you need?

(ps: I wish CSV.jl had filtering options for these kind of requirement)

rafael.guerra · November 10, 2024, 1:11pm

Have you tried Iterators.filter together with CSV.Rows as in this solution?

Or TableOperations.filter as suggested further down in the same thread.

lmiq · November 11, 2024, 2:43pm

Tried something like that now, thanks. But for the moment the best I could obtain was using eachsplit. The issue with CSV.Rows is that the “issues” (instabilities, for example) are then occurring inside the CSV machinery and that gets much harder to debug. But something will come out of that, I’m getting closer to the performance of the direct CSV reading to a DataFrame when I adjust the file to allow that.

StevenSiew · November 12, 2024, 1:23am

Abhro

When you have this finally figured out, could you post a MWE that can process a simple textfile containing integers and comments

4
6
# This is a comment
  78
     # This is also a comment
 8
     12

lmiq · November 12, 2024, 12:12pm

This you can do with DelimitedFiles.readdlm:

julia> data = """
       4
       6
       # This is a comment
         78
            # This is also a comment
        8
            12
       """
"4\n6\n# This is a comment\n  78\n     # This is also a comment\n 8\n     12\n"

julia> readdlm(IOBuffer(data), Int; comments=true, comment_char='#')
5×1 Matrix{Int64}:
  4
  6
 78
  8
 12

Topic		Replies	Views
Read text file containing proper CSV data chunks Data dataframes , csv	11	562	December 20, 2023
What is the recommended way to decorate an IO stream with a filter in Julia Performance question	1	479	September 27, 2019
Reading a file from line x to line y General Usage csv	28	741	May 22, 2024
CSV.Rows filter usage New to Julia csv	1	582	December 7, 2021
Efficiently filter rows while reading very large CSV-ish file Data	5	166	August 14, 2024

How to (efficiently) filter a file/stream on-the-fly?

Related topics