Efficiently filter rows while reading very large CSV-ish file

evanfields · August 13, 2024, 10:07pm

I have a text file representing k-mer counts. It’s order of magnitude a billion lines long, 14 gigabytes gzipped. It looks like so (4 early rows shown by zless):

AAAAAAAAAAAAAAAAAAAAGATTTTAAATATGCAAATTT 6 1 3 0 2 1 2 1 0 0 1
AAAAAAAAAAAAAAAAAAAAGTTAAAAAAAAAAAAAAAAA 0 2 1 1 0 0 0 0 0 0 1
AAAAAAAAAAAAAAAAAAAAGTTAAAAAAAAAGTTAAAAA 1 5 2 0 10 0 0 0 6 0 0
AAAAAAAAAAAAAAAAAAAAGTTCAGGTTGTTAAAAAAAC 0 1 9 0 46 0 3 0 16 0 0

My goal is to efficiently create a DataFrame from the small fraction (~1%) of rows that pass some boolean filter. From the CSV.jl docs I thought CSV.Rows would be an efficient way to iterate row-by-row, seeing each row exactly once and collecting the rows to keep.

But I can’t get even a single line out of my big file:
CSV.Rows("counts_big.txt.gz", header = false, limit = 1, delim = ' ')
Ran for several minutes until I got a MacOS popup warning that my hard drive was full and I killed it. (Iterating over a 10,000 line gzipped subset works just fine.)

Am I misusing CSV.Rows, or is another tool more appropriate here?

Thanks in advance for the help!

nhz2 · August 13, 2024, 10:23pm

You could try using GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl.

stream = GzipDecompressorStream(open("counts_big.txt.gz";read=true))
try
    for line in eachline(stream)
        # Do something with the line, 
        # like filtering and pushing to some vector.
    end
finally
    close(stream)
end

evanfields · August 14, 2024, 1:47pm

Yeah, that may be the right answer. I think CSV.jl uses CodecZlib under the hood to read gzip compressed files, but I was hoping to leverage CSV.jl’s existing infrastructure for parsing, type converting, efficient memory management, etc.

rafael.guerra · August 14, 2024, 2:41pm

Have looked at this solution?

evanfields · August 14, 2024, 3:40pm

That’s the flavor of what I’ve tried, but that linked solution is operating on an uncompressed file, whereas I’m not able to read even a single line from my big gzip-compressed file with CSV.Rows.

era127 · August 14, 2024, 8:04pm

Maybe this will work for you if you’re able to add the Where condition. If you want to sample 1% of the data you can add USING SAMPLE 1% to the end of the query.

using DuckDB
db = DuckDB.DB()
x = DuckDB.query(db, "select * from read_csv('text.csv.gz', compression='gzip', header=F, sep=' ') where ???")

Topic		Replies	Views
Is it possible to iterate over a very large CSV in Windows? New to Julia	9	1002	August 23, 2019
How to subset a stream I/O and pass it to TensorFlow.jl? (question has been updated) Data question	61	4394	May 6, 2017
Reading huge csv files Data	5	4093	January 19, 2019
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4653	January 18, 2024
Is this an efficient way to read a .csv file row by row? General Usage	9	3111	January 27, 2019

Efficiently filter rows while reading very large CSV-ish file

Related topics