Efficient way to update a DataFrame stored in a bson file

fipelle · December 19, 2019, 3:11pm

Hi,

I have two DataFrames: df and new_df. These DataFrames have the same structure (i.e., the same columns). df is stored in a bson file and new_df is computed by some code.

I was wondering what is the most efficient way to update df by appending new_df. I expect df to increase in size quickly and efficient planning could help over time.

sairus7 · December 19, 2019, 4:12pm

It can be efficient only if stored row-by-row, So when you pass several new rows, it does not override the whole file (I don’t know, if this is implemented in DataFrames to BSON serializer).
But most efficient would be to store it column-wise with columns in different binary streams / files, so you can read/write a subset of columns and append individual columns. See, how it’s done in HDF5 format - they have chunks for growing datasets.

fipelle · December 19, 2019, 7:36pm

This is what I am trying to do. I am not sure I understand how I should proceed though.

I can either:

a) save each new row as a separate bson file, or
b) save everything in the same output file.

Ideally, I would prefer to have everything in the same file. However, I think that loading and overwriting the same bson could become heavy over time.

I am flexible towards changing the output type from bson to HDF5 if that helps. However, I am not sure whether you are simply suggesting to proceed with option a.

sairus7 · December 19, 2019, 8:30pm

I was talking about (column-wise) saving each column in a separate binary file. If you need row-wise write, you can just store all table in one binary file row-by-row, and append new rows to its end.
Column-wise option is better and faster for analytics, since you can easily select individual columns without loading the whole table, while row-wise is better for data acquisition row-by-row (see also AoS vs SoA problem). Writing binary files is trivial and fast - no convertion to text is needed. Also, you may want to save metadata file with fieldnames and datatypes, along with binaries. BSON files can be slower and bigger, especially if you duplicate fieldnames for each row.

HDF5 is also very fast and space-efficient due to optional compression, and also supports chunking for incremental writes, you can start with it (actually, Julia and Matlab uses it to store variables to files).

Topic		Replies	Views
Error on load DataFrame with BSON General Usage question , dataframes , bson	1	410	March 6, 2023
Add object to existing BSON file? Data bson , filesystem	1	710	June 13, 2020
Is there a DataFrame that can be memory mapped to a file? Data question , dataframes , mmap	7	770	December 13, 2023
DataFrames and serialization General Usage	0	298	July 11, 2019
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020

Efficient way to update a DataFrame stored in a bson file

Related topics