Inspired by this post (Writing Arrow files by column), I found a quite fast way to save a large number of dataframes into a single Arrow file in the following way
open(Arrow.Writer, "test1.arrow") do writer
for df in values(df_dict_rand)
Arrow.write(writer, df)
end
end
It completely skip the combining step. If I want to have a single dataframe, simply loading back the Arrow file, it is blazingly fast.
Edit:
An even better approach is to use Tables.partitioner
and Arrow.write
which will will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8
or the JULIA_NUM_THREADS
environment variable is set). (from the Arrow.jl doc)
parts = Tables.partitioner(values(df_dict_rand))
Arrow.write("test.arrow", parts)