Export very large dataframe

bojusemo · October 5, 2021, 10:53am

Hi guys,

May someone tell me please how to export a 6G dataframe with 200 million rows and 4 columns? I am using the CSV.write function of the CSV package, but it just exports part of the table. I think it is because the dataframe is too big. Any ideas on how to deal with it? I guess that what I need to do is open a file and write on it, but I do not know how to do it with a dataframe. I really appreciate any help.

Thank you

Boris

rafael.guerra · October 5, 2021, 10:57am

Have you tried CSV.write compress keyword argument?

bojusemo · October 5, 2021, 12:23pm

Hi @rafael.guerra,
Thank you for your reply. I do not see any compress keyword argument in the documentation and compress = true produces an error. Do you have more information?

kobusherbst · October 5, 2021, 12:28pm

I use Arrow.jl to store very large DataFrames (500 million+) rows. The Arrow record batches feature is also pretty useful in processing very large data sets.

xiaodai · October 5, 2021, 12:51pm

If you are not concerned about stability across julia versions you can try

using JDF
JDF.save("path/to/file.jdf", df)

# to load it back
df_copy = JDF.load("path/to/file.jdf")

dlakelan · October 5, 2021, 1:05pm

This is a bug, I’m sure CSV is written to export the whole thing but maybe is encountering something that makes it abort.

rafael.guerra · October 5, 2021, 7:14pm

Compression is supported in CSV.jl v.0.9.

CSV.jl documentation covers this in the writing section.

Tested code below in Win10 Julia 1.7 for a dataframe with 200 million rows and 4 columns with Float64.

It took 2-3 min and 15 GB of disk space, in standard csv writing mode (with no compression).

However, compression seems to be very slow for this type of large size random input data. Same slowness using 7Zip, for instance. Compression seems to take >20 min to complete (aborted it as no patience), and it might achieve only ~50% compression. Tbc.

NB: for smaller random data dataframe, CSV.jl gzip achieved a bit better than 50% compression.

using CSV, DataFrames

nr = 200_000_000;  nc = 4
df = DataFrame(rand(nr,nc), :auto)
CSV.write("df_200M_x_4.csv", df)  #no compression: 2-3 min, 15 GB file
CSV.write("df_200M_x_4.gzip", df, compress=true)  # > 20 min, ~50% compression? Tbc

bojusemo · October 10, 2021, 8:32am

Thank you @rafael.guerra,
I updated Julia to 1.7 and CSV package. After that, a much larger file was exported, but it still did not have all the rows. Maybe there is a limitation resources as I am working in a laptop?

bojusemo · October 10, 2021, 8:33am

thanks for your reply @dlakelan

bojusemo · October 10, 2021, 8:33am

thanks for your reply @kobusherbst

bojusemo · October 10, 2021, 8:34am

thanks for your reply @xiaodai

Topic		Replies	Views
CSV : problem to write big dataframes Data csv	20	2812	May 29, 2023
Questions about csv（How to write to csv faster） General Usage question , csv	10	539	October 26, 2022
InexactError when saving compressed CSV file (but not if I save it uncompressed) General Usage question , csv , zip	3	53	June 10, 2025
Export csv - CSV.jl and CSVFiles do not help General Usage	9	718	October 5, 2018
Fastest way to save a large number of DataFrames to disk Performance dataframes , io , arrow	2	518	May 10, 2024

Export very large dataframe

Related topics