Add rows to DataFrame (or similar table-like struct) in parallel?

I want to produce many rows of a DataFrame in parallel (using a @distributed for loop).

I like DataFrames because then I can Query it which is very convenient. But any other similar table-like structure that can be queried is good for me.

Any recommendations on how to do this? Specifically I need help on how to share the same table object across proceses that are writing to it.

1 Like

I dont think that is currently feasible, but its on the radar for how this might work.

If you are adventurous, you could try forking DataFrames and changing the constructors to make all the arrays SharedArrays and let us know what problems you faced.

1 Like

I use vcat as reducer for @distributed. This way all workers locally vcat their local parts of the overall DataFrame. Finally, the master (the process who @distributed the construction) vcats the worker DataFrames. The speed-up is sufficient for my use case. A SharedDataFrame would be great though.

2 Likes

I think this is enough for my purposes. Thanks!