I have data with columns that can get added or updated at different times. For example, my data might all start at 2021-01-01, have some baseline columns like temperature
and humidity
that get updated every minute, and some expensive derived columns like probability_of_rain
and predicted_temperature_tomorrow
.
Say I add these two columns on 2021-02-01. Then I might not update them for a while, and decide to update probability_of_rain
on 2021-03-01 while leaving the other one alone.
What’s a good way to store this data? One obvious method is to just have a file for each column. I tried looking into Parquet, but it seemed like it’s difficult to add new columns to a partitioned Parquet dataset.
Some other considerations:
- In reality, there are thousands of derived columns. On average, it takes about 5 minutes to recalculate a derived column from scratch, which is what we’re currently doing every time
- Every column is Numeric (Float32)
- My most common query is pulling all the data for a column without any filtering, e.g.
select probability_of_rain from table