Export very large dataframe

Hi guys,

May someone tell me please how to export a 6G dataframe with 200 million rows and 4 columns? I am using the CSV.write function of the CSV package, but it just exports part of the table. I think it is because the dataframe is too big. Any ideas on how to deal with it? I guess that what I need to do is open a file and write on it, but I do not know how to do it with a dataframe. I really appreciate any help.

Thank you

Boris

Have you tried CSV.write compress keyword argument?

Hi @rafael.guerra,
Thank you for your reply. I do not see any compress keyword argument in the documentation and compress = true produces an error. Do you have more information?

I use Arrow.jl to store very large DataFrames (500 million+) rows. The Arrow record batches feature is also pretty useful in processing very large data sets.

2 Likes

If you are not concerned about stability across julia versions you can try

using JDF
JDF.save("path/to/file.jdf", df)

# to load it back
df_copy = JDF.load("path/to/file.jdf")

This is a bug, Iā€™m sure CSV is written to export the whole thing but maybe is encountering something that makes it abort.

Compression is supported in CSV.jl v.0.9.

CSV.jl documentation covers this in the writing section.

Tested code below in Win10 Julia 1.7 for a dataframe with 200 million rows and 4 columns with Float64.

It took 2-3 min and 15 GB of disk space, in standard csv writing mode (with no compression).

However, compression seems to be very slow for this type of large size random input data. Same slowness using 7Zip, for instance. Compression seems to take >20 min to complete (aborted it as no patience), and it might achieve only ~50% compression. Tbc.

NB: for smaller random data dataframe, CSV.jl gzip achieved a bit better than 50% compression.

using CSV, DataFrames

nr = 200_000_000;  nc = 4
df = DataFrame(rand(nr,nc), :auto)
CSV.write("df_200M_x_4.csv", df)  #no compression: 2-3 min, 15 GB file
CSV.write("df_200M_x_4.gzip", df, compress=true)  # > 20 min, ~50% compression? Tbc
1 Like

Thank you @rafael.guerra,
I updated Julia to 1.7 and CSV package. After that, a much larger file was exported, but it still did not have all the rows. Maybe there is a limitation resources as I am working in a laptop?

thanks for your reply @dlakelan

thanks for your reply @kobusherbst

thanks for your reply @xiaodai