Processing multiple large zipped csv files


I am trying to process multiple large zipped csv files. Memory consumption is my main concern at the moment. It seems CSV.File always load everything in memory, so I am using CSV.Rows to iterate through processing. I encountered an issue that CSV.Rows seems to hold double the memory of file size if the data source is a zipped file, and does not have the problem is the source is the uncompressed text file. Memory Consumption of CSV.Rows with ZipFile · Issue #997 · JuliaData/CSV.jl · GitHub

Also, is there any documentation on writing a custom sink for Tables.jl? I am only able to find this
Home · DataStreams.jl which seems to be quite out dated.

Thank you very much.

Note that CSV.File uses mmap, so if you don’t have enough RAM to load the complete file, the OS should take care of loading only the necessary parts at a given time. So a good solution can be to unzip to .csv files on disk, and read those using CSV.File (this is done automatically for .gzip files). You can also do that with CSV.Rows of course. I’m not sure it’s possible to parse compressed CSV files without copying them to memory first – you could try with CodecZLib (GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl.).

DataStreams.jl is a different (and older) package from Tables.jl. See the Tables.jl documentation at Home · Tables.jl.

I am following the example here for CSV ZipFile read. I also traced into CSV.jl code in debugger. It seems the input zip file is unzipped by ZipFile’s reader, and the content is written to a temp file, and mmap’ed by CSV.jl. However that part of the code hold memory which I believe should be freed. Here’s what I shared in the Github issue:

Hi, I am trying to use CSV.Rows to iterate through a zipped text file which is about 3.4GB uncompressed. It seems CSV.Rows is hold a large chunk of memory (about double the size of the file), which defeats the purpose.
If I load the unzipped text file then I don’t see this problem. It seems this line of the code in utils.jl buffer_to_tempfile function allocated memory which isn’t freed. I tried to set stream and output to nothing, and the program is holding memory about the size of the file (instead of double).

I am not sure if this is a problem with how I am using ZipFile, CSV.

using CSV, ZipFile; 
a= CSV.Rows(r; types=[Int64, String, Float64, Float64, Int8, Float64, Float64, Int64]) 
for c in a
    # custom aggregation code on c 

Is there a recommended way to write Julia program to handle this data streaming kind of process?