I have a fixed-width output file from a Fortran program, in a well-defined matrix format, except for the occasional lines which I can regard as comments. I would like to pass the file to something like DelimitedFiles.readdlm() without it trying to parse the comment rows.
Currently, my solution is to read the whole file into memory, create a in-memory stream buffer, then write the non-comment lines to that stream, which readdlm() can parse.
datarow_predicate = !startswith("3333") # comments start with 3333
function filterstream(filename::AbstractString; predicate = datarow_predicate)
filteredstream = IOBuffer()
infilestream = open(filename) # |> GzipDecompressorStream
for line in eachline(infilestream)
if predicate(line)
write(filteredstream, line)
write(filteredstream, "\n")
end
end
close(infilestream)
seekstart(filteredstream)
return filteredstream
end
example_matrix = readdlm(filteredstream("out.dat"))
But this seems like a bad idea, because it reads everything into memory. On Python, a better approach would be to create a generator function that yields every line if it passes the test. Is there a way to do something similar in Julia? That is, some sort of function/technique that reads and filters lines from a file/stream on-demand as a function like readdlm() asks for them?
Tried something like that now, thanks. But for the moment the best I could obtain was using eachsplit. The issue with CSV.Rows is that the “issues” (instabilities, for example) are then occurring inside the CSV machinery and that gets much harder to debug. But something will come out of that, I’m getting closer to the performance of the direct CSV reading to a DataFrame when I adjust the file to allow that.
julia> data = """
4
6
# This is a comment
78
# This is also a comment
8
12
"""
"4\n6\n# This is a comment\n 78\n # This is also a comment\n 8\n 12\n"
julia> readdlm(IOBuffer(data), Int; comments=true, comment_char='#')
5Ă—1 Matrix{Int64}:
4
6
78
8
12