How to store data where columns are added/updated separately?

Satvik · March 17, 2022, 9:38pm

I have data with columns that can get added or updated at different times. For example, my data might all start at 2021-01-01, have some baseline columns like temperature and humidity that get updated every minute, and some expensive derived columns like probability_of_rain and predicted_temperature_tomorrow.

Say I add these two columns on 2021-02-01. Then I might not update them for a while, and decide to update probability_of_rain on 2021-03-01 while leaving the other one alone.

What’s a good way to store this data? One obvious method is to just have a file for each column. I tried looking into Parquet, but it seemed like it’s difficult to add new columns to a partitioned Parquet dataset.

Some other considerations:

In reality, there are thousands of derived columns. On average, it takes about 5 minutes to recalculate a derived column from scratch, which is what we’re currently doing every time
Every column is Numeric (Float32)
My most common query is pulling all the data for a column without any filtering, e.g. select probability_of_rain from table

Topic		Replies	Views
Parquet: writing data as row groups Data question	1	165	July 22, 2024
How to remove columns from Dataframe General Usage dataframes	1	8744	June 8, 2022
TimeSeries Database General Usage question , data , time-series , csv , database	8	692	February 21, 2023
Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame Performance question , dataframes	6	715	October 27, 2022
Saving data tables at different timepoints Data question , dataframes	6	611	November 22, 2022

How to store data where columns are added/updated separately?

Related topics