When working with data at this scale, computations do indeed usually need some form (and usually multiple forms) of caching!
However, the most appropriate cache is depends on your computation/access pattern, which is orthogonal to (or at least overtop of) Onda itself as a format. At Beacon, for example, we have some workloads that read the same segments repeatedly, but others that sweep through all segments and load each only once. Onda.jl makes it easy/possible to read segments by TimeSpan but (rightfully, IMO) isn’t opinionated w.r.t. caching.
We do have a nice little LRU cache implementation that auto-spills to disk, though - should probably upstream that to LRUCaches.jl or somewhere
Would love to see more cache utilities on top of - but not within - Onda.jl to facilitate common access patterns (or better, composing with Onda.jl without needing to explicitly depend on it).
For my usage annotations wouldn’t be that excessive. I’ve the feeling that if you are talking about this kind of numbers (>100000) the annotations are either outputs of some kind of algorithm or a signal of its own. From my point of view, both shouldn’t be handled as annotations.
Yup, this is how we treat it; for us, at least, the natural thing is to treat “dense annotations” as signals in their own right.
with additional event dictionary
There’s no canonical specification of this in the format, though you could roll your own however you’d like. Defining a canonical spec for this might not be a bad idea, though - this what I meant by “perhaps it wouldn’t be a bad idea for the format to define a structure for categorical sample_unit s…”.
Onda is a layer above already stored files of different formats
Yes, as well as a format for structuring signal metadata + data model that allows you treat all those files similarly as LPCM signals encoded in their own ways.
Do you use some file database with tagging / multiple grouping functionality?
Right now we’re just using S3 for object storage; ingest of new Onda datasets is then just a matter of shoving the metadata into a database (that indexes the S3 objects) and the sample data into S3.
Or its compatible with HDF5 and can be saved in it?
An Onda dataset is just a directory with a fairly simple structure, so I don’t see why it couldn’t be saved in HDF5 AFAICT. The comparison paragraph linked above intends to explain why Onda as format isn’t defined on top of HDF5, but I should update it to make it clear that Onda isn’t incompatible with HDF5 (purposefully, anyway). EDIT ref clarify Onda's relationship with HDF5 by jrevels · Pull Request #12 · beacon-biosignals/OndaFormat · GitHub