Export very large dataframe

Compression is supported in CSV.jl v.0.9.

CSV.jl documentation covers this in the writing section.

Tested code below in Win10 Julia 1.7 for a dataframe with 200 million rows and 4 columns with Float64.

It took 2-3 min and 15 GB of disk space, in standard csv writing mode (with no compression).

However, compression seems to be very slow for this type of large size random input data. Same slowness using 7Zip, for instance. Compression seems to take >20 min to complete (aborted it as no patience), and it might achieve only ~50% compression. Tbc.

NB: for smaller random data dataframe, CSV.jl gzip achieved a bit better than 50% compression.

using CSV, DataFrames

nr = 200_000_000;  nc = 4
df = DataFrame(rand(nr,nc), :auto)
CSV.write("df_200M_x_4.csv", df)  #no compression: 2-3 min, 15 GB file
CSV.write("df_200M_x_4.gzip", df, compress=true)  # > 20 min, ~50% compression? Tbc
1 Like