Processing multiple large zipped csv files

thedacheng · April 3, 2022, 5:26pm

I am trying to process multiple large zipped csv files. Memory consumption is my main concern at the moment. It seems CSV.File always load everything in memory, so I am using CSV.Rows to iterate through processing. I encountered an issue that CSV.Rows seems to hold double the memory of file size if the data source is a zipped file, and does not have the problem is the source is the uncompressed text file. https://github.com/JuliaData/CSV.jl/issues/997

Also, is there any documentation on writing a custom sink for Tables.jl? I am only able to find this
Home · DataStreams.jl which seems to be quite out dated.

Thank you very much.

nalimilan · April 3, 2022, 5:51pm

Note that CSV.File uses mmap, so if you don’t have enough RAM to load the complete file, the OS should take care of loading only the necessary parts at a given time. So a good solution can be to unzip to .csv files on disk, and read those using CSV.File (this is done automatically for .gzip files). You can also do that with CSV.Rows of course. I’m not sure it’s possible to parse compressed CSV files without copying them to memory first – you could try with CodecZLib (GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl.).

DataStreams.jl is a different (and older) package from Tables.jl. See the Tables.jl documentation at Home · Tables.jl.

thedacheng · April 4, 2022, 3:38am

I am following the example here for CSV ZipFile read. I also traced into CSV.jl code in debugger. It seems the input zip file is unzipped by ZipFile’s reader, and the content is written to a temp file, and mmap’ed by CSV.jl. However that part of the code hold memory which I believe should be freed. Here’s what I shared in the Github issue:

Hi, I am trying to use CSV.Rows to iterate through a zipped text file which is about 3.4GB uncompressed. It seems CSV.Rows is hold a large chunk of memory (about double the size of the file), which defeats the purpose.
If I load the unzipped text file then I don’t see this problem. It seems this line of the code in utils.jl buffer_to_tempfile function allocated memory which isn’t freed. I tried to set stream and output to nothing, and the program is holding memory about the size of the file (instead of double).

I am not sure if this is a problem with how I am using ZipFile, CSV.

using CSV, ZipFile; 
z=ZipFile.Reader("test.zip") 
r=z.files[1] 
a= CSV.Rows(r; types=[Int64, String, Float64, Float64, Int8, Float64, Float64, Int64]) 
for c in a
    # custom aggregation code on c 
end

Is there a recommended way to write Julia program to handle this data streaming kind of process?

Topic		Replies	Views
Is it possible to iterate over a very large CSV in Windows? New to Julia	9	978	August 23, 2019
Efficiently filter rows while reading very large CSV-ish file Data	5	166	August 14, 2024
How to read a compressed CSV file? New to Julia	11	4890	January 17, 2019
Pipe to CSV.Source Data question	4	1201	May 22, 2017
CSV mmap error when parsing large file General Usage package	6	2403	May 25, 2019

Processing multiple large zipped csv files

Related topics