Streaming Tables to CSV

How can I stream some rows to a CSV file (i.e. I have an iterator that creates rows, and I want those rows to be written to a file without requiring enough memory for all of the rows).

It seems like the CSV.jl library is designed to write a stream of data from an in-memory table, but it is not capable of streaming data from a streaming table:

CSV.RowWriter(table; kwargs...)

This table must implement Tables.jl interface, which seems like a low level interface library primarily implemented by DataFrames.jl. I can’t find a way to make a DataFrame that is generated dynamically row by row, such as when the data size is too large to fit in memory.

https://csv.juliadata.org/stable/writing.html

Is a loop that calls print too slow?

My impression is that a lot of the optimizations in the CSV package would be inapplicable to writing row by row, so another alternative would be to write the data in chunks: save a few thousand rows to an array or a DataTable and call CSV.write, then move on to the next chunk (re-using/over-writing the same memory buffer for every chunk).

Calling print isn’t too slow — I just want to leverage the CSV library so I don’t have to write my own escaping/delimiter logic.

I guess I can implement chunking, it just seems like it increases the complexity of what I’m doing.

The writedlm function in the DelimitedFiles standard library accepts ā€œan iterable collection of iterable rowsā€.

The main reasons to use CSV.jl instead of DelimitedFiles are performance and Tables.jl support. Since it seems you need neither of these, I would just use DelimitedFiles.

3 Likes

An iterator of named tuples would be a valid (and lazy) Tables.jl table:

get_row(n) = (; col1="hi $n", col2=n^2)
table = (get_row(n) for n=1:100)

Definitely not used just by DataFrames!

4 Likes

Thanks! I think @stevengj 's suggestion to use writedlm definitely work for what I’m doing, but this is exactly what I was asking about.

1 Like

A better solution is to pass the dataframe to be written to the function eachrow() in the argument of either writedlm() or CSV.write():

# first with writedlm from DelimitedFiles
writedlm("path_to_save_folder/file_name.csv", eachrow(dataframe))
# now with CSV.write 
CSV.write("path_to_save_folder/file_name.csv", eachrow(dataframe))

Both work well but I experienced a sudden overshoot in memory for writedlm(), which did not happen for CSV.write().