Streaming Tables to CSV

oconnore · May 10, 2022, 10:47pm

How can I stream some rows to a CSV file (i.e. I have an iterator that creates rows, and I want those rows to be written to a file without requiring enough memory for all of the rows).

It seems like the CSV.jl library is designed to write a stream of data from an in-memory table, but it is not capable of streaming data from a streaming table:

CSV.RowWriter(table; kwargs...)

This table must implement Tables.jl interface, which seems like a low level interface library primarily implemented by DataFrames.jl. I can’t find a way to make a DataFrame that is generated dynamically row by row, such as when the data size is too large to fit in memory.

https://csv.juliadata.org/stable/writing.html

stevengj · May 10, 2022, 11:25pm

Is a loop that calls print too slow?

My impression is that a lot of the optimizations in the CSV package would be inapplicable to writing row by row, so another alternative would be to write the data in chunks: save a few thousand rows to an array or a DataTable and call CSV.write, then move on to the next chunk (re-using/over-writing the same memory buffer for every chunk).

oconnore · May 10, 2022, 11:32pm

Calling print isn’t too slow — I just want to leverage the CSV library so I don’t have to write my own escaping/delimiter logic.

I guess I can implement chunking, it just seems like it increases the complexity of what I’m doing.

stevengj · May 11, 2022, 12:02am

The writedlm function in the DelimitedFiles standard library accepts “an iterable collection of iterable rows”.

The main reasons to use CSV.jl instead of DelimitedFiles are performance and Tables.jl support. Since it seems you need neither of these, I would just use DelimitedFiles.

ericphanson · May 11, 2022, 12:20am

An iterator of named tuples would be a valid (and lazy) Tables.jl table:

get_row(n) = (; col1="hi $n", col2=n^2)
table = (get_row(n) for n=1:100)

Definitely not used just by DataFrames!

oconnore · May 11, 2022, 7:05pm

Thanks! I think @stevengj 's suggestion to use writedlm definitely work for what I’m doing, but this is exactly what I was asking about.

alvarodiaz · April 25, 2023, 1:15pm

A better solution is to pass the dataframe to be written to the function eachrow() in the argument of either writedlm() or CSV.write():

# first with writedlm from DelimitedFiles
writedlm("path_to_save_folder/file_name.csv", eachrow(dataframe))
# now with CSV.write 
CSV.write("path_to_save_folder/file_name.csv", eachrow(dataframe))

Both work well but I experienced a sudden overshoot in memory for writedlm(), which did not happen for CSV.write().

Topic		Replies	Views
JuliaDB - Saving to CSV New to Julia question	11	5205	June 18, 2019
CSV.jl: Write csv row by row General Usage	3	1149	June 18, 2020
Is it possible to iterate over a very large CSV in Windows? New to Julia	9	1002	August 23, 2019
How to use CSVFile as iterable? New to Julia	3	644	August 26, 2019
Partition a large CSV file into smaller files without loading into memory General Usage question , csv , io	6	3712	March 10, 2019

Streaming Tables to CSV

Related topics