Fastest way to save a large number of DataFrames to disk

Inspired by this post (Writing Arrow files by column), I found a quite fast way to save a large number of dataframes into a single Arrow file in the following way

open(Arrow.Writer, "test1.arrow") do writer
    for df in values(df_dict_rand)
        Arrow.write(writer, df)
    end
end

It completely skip the combining step. If I want to have a single dataframe, simply loading back the Arrow file, it is blazingly fast.

Edit:

An even better approach is to use Tables.partitioner and Arrow.write which will will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set). (from the Arrow.jl doc)

parts = Tables.partitioner(values(df_dict_rand))
Arrow.write("test.arrow", parts)
8 Likes