How to store data where columns are added/updated separately?

I have data with columns that can get added or updated at different times. For example, my data might all start at 2021-01-01, have some baseline columns like temperature and humidity that get updated every minute, and some expensive derived columns like probability_of_rain and predicted_temperature_tomorrow.

Say I add these two columns on 2021-02-01. Then I might not update them for a while, and decide to update probability_of_rain on 2021-03-01 while leaving the other one alone.

What’s a good way to store this data? One obvious method is to just have a file for each column. I tried looking into Parquet, but it seemed like it’s difficult to add new columns to a partitioned Parquet dataset.

Some other considerations:

  • In reality, there are thousands of derived columns. On average, it takes about 5 minutes to recalculate a derived column from scratch, which is what we’re currently doing every time
  • Every column is Numeric (Float32)
  • My most common query is pulling all the data for a column without any filtering, e.g. select probability_of_rain from table