Fastest way to save a large number of DataFrames to disk

liuyxpp · May 8, 2024, 3:28pm

Inspired by this post (Writing Arrow files by column), I found a quite fast way to save a large number of dataframes into a single Arrow file in the following way

open(Arrow.Writer, "test1.arrow") do writer
    for df in values(df_dict_rand)
        Arrow.write(writer, df)
    end
end

It completely skip the combining step. If I want to have a single dataframe, simply loading back the Arrow file, it is blazingly fast.

Edit:

An even better approach is to use Tables.partitioner and Arrow.write which will will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set). (from the Arrow.jl doc)

parts = Tables.partitioner(values(df_dict_rand))
Arrow.write("test.arrow", parts)

Topic		Replies	Views
Export very large dataframe General Usage question , dataframes , csv , io	10	1085	October 10, 2021
Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead Performance question , dataframes	14	800	September 22, 2023
Questions about csv（How to write to csv faster） General Usage question , csv	10	540	October 26, 2022
Is it possible to join DataFrame with Arrow Table ensuring unique rows without bringing Arrow Table into RAM? New to Julia question , dataframes , arrow	3	652	April 3, 2023
Arrow stream usage clarification Data dataframes , arrow	10	1563	July 17, 2023

Fastest way to save a large number of DataFrames to disk

Related topics