Efficiently filter rows while reading very large CSV-ish file

I have a text file representing k-mer counts. It’s order of magnitude a billion lines long, 14 gigabytes gzipped. It looks like so (4 early rows shown by zless):

AAAAAAAAAAAAAAAAAAAAGATTTTAAATATGCAAATTT 6 1 3 0 2 1 2 1 0 0 1
AAAAAAAAAAAAAAAAAAAAGTTAAAAAAAAAAAAAAAAA 0 2 1 1 0 0 0 0 0 0 1
AAAAAAAAAAAAAAAAAAAAGTTAAAAAAAAAGTTAAAAA 1 5 2 0 10 0 0 0 6 0 0
AAAAAAAAAAAAAAAAAAAAGTTCAGGTTGTTAAAAAAAC 0 1 9 0 46 0 3 0 16 0 0

My goal is to efficiently create a DataFrame from the small fraction (~1%) of rows that pass some boolean filter. From the CSV.jl docs I thought CSV.Rows would be an efficient way to iterate row-by-row, seeing each row exactly once and collecting the rows to keep.

But I can’t get even a single line out of my big file:
CSV.Rows("counts_big.txt.gz", header = false, limit = 1, delim = ' ')
Ran for several minutes until I got a MacOS popup warning that my hard drive was full and I killed it. (Iterating over a 10,000 line gzipped subset works just fine.)

Am I misusing CSV.Rows, or is another tool more appropriate here?

Thanks in advance for the help!

1 Like

You could try using GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl.

stream = GzipDecompressorStream(open("counts_big.txt.gz";read=true))
try
    for line in eachline(stream)
        # Do something with the line, 
        # like filtering and pushing to some vector.
    end
finally
    close(stream)
end

Yeah, that may be the right answer. I think CSV.jl uses CodecZlib under the hood to read gzip compressed files, but I was hoping to leverage CSV.jl’s existing infrastructure for parsing, type converting, efficient memory management, etc.

Have looked at this solution?

That’s the flavor of what I’ve tried, but that linked solution is operating on an uncompressed file, whereas I’m not able to read even a single line from my big gzip-compressed file with CSV.Rows.

Maybe this will work for you if you’re able to add the Where condition. If you want to sample 1% of the data you can add USING SAMPLE 1% to the end of the query.

using DuckDB
db = DuckDB.DB()
x = DuckDB.query(db, "select * from read_csv('text.csv.gz', compression='gzip', header=F, sep=' ') where ???")