Add rows to DataFrame (or similar table-like struct) in parallel?

e3c6 · November 19, 2018, 5:31pm

I want to produce many rows of a DataFrame in parallel (using a @distributed for loop).

I like DataFrames because then I can Query it which is very convenient. But any other similar table-like structure that can be queried is good for me.

Any recommendations on how to do this? Specifically I need help on how to share the same table object across proceses that are writing to it.

pdeffebach · November 19, 2018, 5:51pm

I dont think that is currently feasible, but its on the radar for how this might work.

If you are adventurous, you could try forking DataFrames and changing the constructors to make all the arrays SharedArrays and let us know what problems you faced.

carstenbauer · November 19, 2018, 6:27pm

I use vcat as reducer for @distributed. This way all workers locally vcat their local parts of the overall DataFrame. Finally, the master (the process who @distributed the construction) vcats the worker DataFrames. The speed-up is sufficient for my use case. A SharedDataFrame would be great though.

e3c6 · November 19, 2018, 6:42pm

I think this is enough for my purposes. Thanks!

Topic		Replies	Views
SharedArray using DataFrames() General Usage question , parallel , dataframes , sharedarrays	0	67	July 11, 2024
Efficiently using single large dataframe over multiple workers Performance	10	2388	June 15, 2018
Parallel DataFrame processing Data parallel , dataframes	2	2301	October 3, 2019
Processing Rows with multiple threads Performance question	6	731	August 20, 2020
Concatenating DataFrames in parallel Data	4	332	May 10, 2023

Add rows to DataFrame (or similar table-like struct) in parallel?

Related topics