Efficient way to update a DataFrame stored in a bson file

Hi,

I have two DataFrames: df and new_df. These DataFrames have the same structure (i.e., the same columns). df is stored in a bson file and new_df is computed by some code.

I was wondering what is the most efficient way to update df by appending new_df. I expect df to increase in size quickly and efficient planning could help over time.

It can be efficient only if stored row-by-row, So when you pass several new rows, it does not override the whole file (I don’t know, if this is implemented in DataFrames to BSON serializer).
But most efficient would be to store it column-wise with columns in different binary streams / files, so you can read/write a subset of columns and append individual columns. See, how it’s done in HDF5 format - they have chunks for growing datasets.

1 Like

This is what I am trying to do. I am not sure I understand how I should proceed though.

I can either:

  • a) save each new row as a separate bson file, or
  • b) save everything in the same output file.

Ideally, I would prefer to have everything in the same file. However, I think that loading and overwriting the same bson could become heavy over time.

I am flexible towards changing the output type from bson to HDF5 if that helps. However, I am not sure whether you are simply suggesting to proceed with option a.

I was talking about (column-wise) saving each column in a separate binary file. If you need row-wise write, you can just store all table in one binary file row-by-row, and append new rows to its end.
Column-wise option is better and faster for analytics, since you can easily select individual columns without loading the whole table, while row-wise is better for data acquisition row-by-row (see also AoS vs SoA problem). Writing binary files is trivial and fast - no convertion to text is needed. Also, you may want to save metadata file with fieldnames and datatypes, along with binaries. BSON files can be slower and bigger, especially if you duplicate fieldnames for each row.

HDF5 is also very fast and space-efficient due to optional compression, and also supports chunking for incremental writes, you can start with it (actually, Julia and Matlab uses it to store variables to files).

2 Likes