Fastest way to save a large number of DataFrames to disk

liuyxpp · May 8, 2024, 2:43pm

I have a large number of dataframes generated from other sources like

using DataFrames
N = 1000
df_dict = Dict(1:N .=> [DataFrame(rand(rand(1000:2000), 146), :auto) for _ in 1:N])

I’d like to save them into a single file. For this purpose, I choose Arrow.jl. I first combine all those dataframes into a single dataframe

df = reduce(vcat, values(df_dict))

Then I save it as a Arrow file

using Arrow
Arrow.write("test.arrow", df)

But the combining step is extremely slow (takes minutes to hours to finish). Is there any better way to save this large number of dataframes to disk? N can be as large as 5000, and the number of rows can be as large as 3000.

I also tried to save each dataframe as a single CSV file, the situation seems not improve significantly. Even worse, reading back these files take much longer than reading from an Arrow file.

liuyxpp · May 8, 2024, 3:28pm

Inspired by this post (Writing Arrow files by column), I found a quite fast way to save a large number of dataframes into a single Arrow file in the following way

open(Arrow.Writer, "test1.arrow") do writer
    for df in values(df_dict_rand)
        Arrow.write(writer, df)
    end
end

It completely skip the combining step. If I want to have a single dataframe, simply loading back the Arrow file, it is blazingly fast.

Edit:

An even better approach is to use Tables.partitioner and Arrow.write which will will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set). (from the Arrow.jl doc)

parts = Tables.partitioner(values(df_dict_rand))
Arrow.write("test.arrow", parts)

rocco_sprmnt21 · May 10, 2024, 12:07pm

It is not intended to be a proposal for the problem under consideration, just an alternative in the case of a few df (3’ to save the CSV of 100 df) and for those who do not have/want to depend on Arrow

bigdf=DataFrame((ddf=[last(e) for e in df_dict],))

CSV.write("CSV_bigdf.csv", bigdf)

I don’t know if this is a relevant situation, but I would like to point out that Arroe does not allow saving a DataFrame that has nested DataFrames.

julia> Arrow.write("arrow_bigdf.arrow", bigdf)  # error
ERROR: ArgumentError: `valtype(d)` must be concrete to serialize map-like `d`, but `valtype(d) == Tuple{Any, Any}`
Stacktrace:
  [1] arrowvector(::ArrowTypes.MapKind, x::

Using tables.partitioner CSV for n=100 takes about 10 sec to save the file

N = 100

df_dict = Dict(1:N .=> [DataFrame(rand(rand(1000:2000), 146), :auto) for _ in 1:N])

parts = Tables.partitioner(values(df_dict))

julia> @time CSV.write("CSV_bigdf_parts.csv", parts)
  9.675832 seconds (65.64 M allocations: 9.869 GiB, 16.34% gc time)
"CSV_bigdf_parts.csv"

Topic		Replies	Views
Export very large dataframe General Usage question , dataframes , csv , io	10	1086	October 10, 2021
Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead Performance question , dataframes	14	802	September 22, 2023
Questions about csv（How to write to csv faster） General Usage question , csv	10	540	October 26, 2022
Is it possible to join DataFrame with Arrow Table ensuring unique rows without bringing Arrow Table into RAM? New to Julia question , dataframes , arrow	3	654	April 3, 2023
Arrow stream usage clarification Data dataframes , arrow	10	1564	July 17, 2023

Fastest way to save a large number of DataFrames to disk

Related topics