Fastest way to save a large number of DataFrames to disk

I have a large number of dataframes generated from other sources like

using DataFrames
N = 1000
df_dict = Dict(1:N .=> [DataFrame(rand(rand(1000:2000), 146), :auto) for _ in 1:N])

I’d like to save them into a single file. For this purpose, I choose Arrow.jl. I first combine all those dataframes into a single dataframe

df = reduce(vcat, values(df_dict))

Then I save it as a Arrow file

using Arrow
Arrow.write("test.arrow", df)

But the combining step is extremely slow (takes minutes to hours to finish). Is there any better way to save this large number of dataframes to disk? N can be as large as 5000, and the number of rows can be as large as 3000.

I also tried to save each dataframe as a single CSV file, the situation seems not improve significantly. Even worse, reading back these files take much longer than reading from an Arrow file.

2 Likes

Inspired by this post (Writing Arrow files by column), I found a quite fast way to save a large number of dataframes into a single Arrow file in the following way

open(Arrow.Writer, "test1.arrow") do writer
    for df in values(df_dict_rand)
        Arrow.write(writer, df)
    end
end

It completely skip the combining step. If I want to have a single dataframe, simply loading back the Arrow file, it is blazingly fast.

Edit:

An even better approach is to use Tables.partitioner and Arrow.write which will will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set). (from the Arrow.jl doc)

parts = Tables.partitioner(values(df_dict_rand))
Arrow.write("test.arrow", parts)
8 Likes

It is not intended to be a proposal for the problem under consideration, just an alternative in the case of a few df (3’ to save the CSV of 100 df) and for those who do not have/want to depend on Arrow

bigdf=DataFrame((ddf=[last(e) for e in df_dict],))

CSV.write("CSV_bigdf.csv", bigdf)

I don’t know if this is a relevant situation, but I would like to point out that Arroe does not allow saving a DataFrame that has nested DataFrames.

julia> Arrow.write("arrow_bigdf.arrow", bigdf)  # error
ERROR: ArgumentError: `valtype(d)` must be concrete to serialize map-like `d`, but `valtype(d) == Tuple{Any, Any}`
Stacktrace:
  [1] arrowvector(::ArrowTypes.MapKind, x::

Using tables.partitioner CSV for n=100 takes about 10 sec to save the file

N = 100

df_dict = Dict(1:N .=> [DataFrame(rand(rand(1000:2000), 146), :auto) for _ in 1:N])

parts = Tables.partitioner(values(df_dict))

julia> @time CSV.write("CSV_bigdf_parts.csv", parts)
  9.675832 seconds (65.64 M allocations: 9.869 GiB, 16.34% gc time)
"CSV_bigdf_parts.csv"