Streaming Tables to CSV

How can I stream some rows to a CSV file (i.e. I have an iterator that creates rows, and I want those rows to be written to a file without requiring enough memory for all of the rows).

It seems like the CSV.jl library is designed to write a stream of data from an in-memory table, but it is not capable of streaming data from a streaming table:

CSV.RowWriter(table; kwargs...)

This table must implement Tables.jl interface, which seems like a low level interface library primarily implemented by DataFrames.jl. I can’t find a way to make a DataFrame that is generated dynamically row by row, such as when the data size is too large to fit in memory.

https://csv.juliadata.org/stable/writing.html

Is a loop that calls print too slow?

My impression is that a lot of the optimizations in the CSV package would be inapplicable to writing row by row, so another alternative would be to write the data in chunks: save a few thousand rows to an array or a DataTable and call CSV.write, then move on to the next chunk (re-using/over-writing the same memory buffer for every chunk).

Calling print isn’t too slow — I just want to leverage the CSV library so I don’t have to write my own escaping/delimiter logic.

I guess I can implement chunking, it just seems like it increases the complexity of what I’m doing.

The writedlm function in the DelimitedFiles standard library accepts “an iterable collection of iterable rows”.

The main reasons to use CSV.jl instead of DelimitedFiles are performance and Tables.jl support. Since it seems you need neither of these, I would just use DelimitedFiles.

3 Likes

An iterator of named tuples would be a valid (and lazy) Tables.jl table:

get_row(n) = (; col1="hi $n", col2=n^2)
table = (get_row(n) for n=1:100)

Definitely not used just by DataFrames!

4 Likes

Thanks! I think @stevengj 's suggestion to use writedlm definitely work for what I’m doing, but this is exactly what I was asking about.

1 Like

A better solution is to pass the dataframe to be written to the function eachrow() in the argument of either writedlm() or CSV.write():

# first with writedlm from DelimitedFiles
writedlm("path_to_save_folder/file_name.csv", eachrow(dataframe))
# now with CSV.write 
CSV.write("path_to_save_folder/file_name.csv", eachrow(dataframe))

Both work well but I experienced a sudden overshoot in memory for writedlm(), which did not happen for CSV.write().