Hello,
I’ve a large number of time series as result of my simulations.
As today the dataset is made of ~1e7 time series (100 points each), stored on disk in .csv files (each one containing 5 of them). I’ll have more (10x more) in the future and I need to find a good way of exploring the dataset.
I am thinking of a SQL database (like timescale) but I wonder if that would be overkill for my use case.
Consider using a binary format designed for scientific data, e.g. HDF5.jl.
1e8 datasets of 100 points each, in double precision, is only about 80GB (and you might also consider single precision to save 50%), which is easily possible to store in a single file. In HDF5 you can store this as a single Nx100 dataset that is “extendable” (N is an “unlimited” dimension), so that N can grow over time. And then you can read slices of this quickly, or even the whole dataset if you have enough RAM. (Unfortunately, HDF5.jl currently has very limited documentation for extendable datasets, but the underlying code supports this.)
I would use a SQL database. I’ve found them good even for relatively small use cases, systems like Postgres are extremely well-tested and solve many tricky problems like upserts. It often makes sense to default to using a relational db until you have a specific use case that would benefit from something else.
I haven’t used timescale specifically but I’ve heard good things.
Similar experience here with HDF5 files, have stored 100million+ records in a single file with 5million/sec average read time (of slices). This was way back in 2016 on an older Intel chip with data stored on a SSD.
I would have recommended SQLite also if it was not for 1e8 datasets, I don’t know how that will perform. I have experience of storing 100+ timeseries tables with 1000+ records in a SQLite file and being able to read any full table (without WHERE clauses) very quickly into memory (no performance concrete numbers available).
As for the databases, you may also consider QuestDB (TimeSeries DB, Java + Rust). It is an open-source alternative to kdb+ (the financial time series standard) and is a pure joy to use (easy setup, convenient, and performant). Basic charting can be done in just a second from within the GUI, or, as it is Postgres-compatible, with Grafana or Pulse (TimeStored). The recent (8.0) open-source version supports ZFS, which you may extend with BeeGFS. However, it is, of course, not mandatory for 80GB. Reading can be done with LibPQ.jl, and writing with QuestDBClient.j and QuestDB.jl (the latter looks like a work in progress but seems to be faster than the first). There is also a REST API and, of course, native support for CSV (including GUI).
For me SQLite (wrapped in Julia) handles ≈1.8e9 records in a single ≈400 GB file fine, definitely much faster than .csv. But I have not benchmarked yet against HDF5, which could be a notch faster regarding the raw reading.
What does “exploring” these 1e8 time series mean? If it is simulated data then presumably a large number of parameters with complex relations were varied for all the different time series? Then the SQL itself and the option to have multiple indices that can be changed in the course of the exploration could be handy. HDF5 support for indexing is limited, or requires extensions.
My QuestDB instance is currently quite small, consisting of 10,807 files totaling 4.3 GB. It runs on a cloud VM with 3 ARM64 cores and 16 GB of RAM inside a Podman container. BeeGFS is served by 6 VMs with mixed bandwidth ranging from 0.5 Gbps to 4 Gbps. Its storage is located on an XFS filesystem, with each volume providing 11,250 IOPS (this is rather a significant number for the cloud but not particularly impressive for SSD or NVMe). Hope it helps.
2. QUERY on 3 columns of table A (19,365,625 rows x 18 columns)
[…]
V1 ASOF JOIN V2 SAMPLE BY 15m
[…]
avg(correlation) OVER (ORDER BY timestamp RANGE BETWEEN 1 HOUR PRECEDING AND CURRENT ROW) AS rolling_corr_1h,
avg(correlation) OVER (ORDER BY timestamp RANGE BETWEEN 3 HOUR PRECEDING AND CURRENT ROW) AS rolling_corr_3h,
avg(correlation) OVER (ORDER BY timestamp RANGE BETWEEN 1 DAY PRECEDING AND CURRENT ROW) AS rolling_corr_1d
[…]
EDIT: Naturally, the filesystem’s metadata is all warmed up; however, feel free to make your own judgment. In fact, to be honest, I bet that even @stevengj could be cracking a smile right now! :- )
If the data is this regular you could also just call something like
open(f -> write(f, array), filename, "a")
to append each length-100 dataset to a raw binary file, and
using Mmap
f = open(filename, "r")
n = (filesize(f) ÷ sizeof(Float64)) ÷ 100
a = mmap(f, Matrix{Float64}, (100, n))
to open the whole dataset simultaneously as a memory-mapped matrix, at which point you can do any analysis you want (e.g. slicing out a random subset of columns) easily and efficiently.
HDF5 looks interesting and the idea closely match my current folder structure. However, I don’t have enough RAM to hold the entire dataset, would that be a problem? This is why I was thinking of a SQL database
Regarding the post processing, I don’t know what I’ll do with the data yet.
I’m going to heavily use metadata from each simulation. I’m simulating a virtual population of patients so I may want to look at all the 65+ yo patients with height > 1.50m and low blood pressure. I imagine this is doable with the HDF5 approach by carefully matching the matrix indices to relevant metadata but it’s possibly easier to do that in SQL
To use HDF5, you don’t need to be able to hold all data or even an entire dataset in memory, you can read smaller pieces. Personally, I like HDF5 to hold data in an organized manner, and I find it fairly easy and convenient to use, and the performance is good enough.
No, not for HDF5. And not for the Mmap solution I suggested as well — the whole point of a memory mapped array is that the operating system will only “page in” those portions of the array that you are actually accessing.
Storing the actual time series directly in a file and mmap for access is probably the most efficient method there is, followed by HDF5.
But for the metadata an SQL database should be considered. It seems that there are at least three parameters, patient age, height, and blood pressure. Only one of them, for example height, could be monotonically “matched” to the matrix row nr (enabling binary search). Even this would not work, if you need to do more simulations after seeing initial results of the “exploration”. In a relational database indices can be freely created in any combination and order, which map to the row nr of the matrix of timeseries. The indices update automatically when more data/timeseries are added. “Carefully matching … to relevant metadata” can be done in HDF5 or directly stored tables, but it becomes rather complex for more than one parameter. SQL is kind of made for such a job.
So I would perhaps store the timeseries directly, use mmap, but design an SQL database for the metadata.