Thanks for the interest and thoughtful questions
You mention that its suitable for recordings where at least one signal fits into memory.
Ah, actually, these days the format is indeed suitable for larger signals. I’ve opened a PR to update the “useful for” list.
What are your thoughts on utilizing mmap for larger files?
Totally kosher if the format is raw .lpcm
. Actually doing so is a bit manual, however…probably wouldn’t be a bad idea to add an mmap
option to the load
interface:
using Onda, Mmap
include(joinpath(dirname(dirname(pathof(Onda))), "examples", "tour.jl"))
file = open(samples_path(dataset, uuid, :eeg))
S = eeg_signal.sample_type
nrows = channel_count(eeg_signal)
ncols = Int(filesize(file) / sizeof(S) / nrows)
samples = Samples(eeg_signal, true, Mmap.mmap(file, Matrix{S}, (nrows, ncols)))
# now performing normal operations on `Samples` (e.g. decoding only
# a specific region) works the same as it usually does:
decoded_region = decode(view(samples, :, TimeSpan(Second(3), Second(5))))
Note that writing the above made me realize that samples_path
wasn’t super ergonomic or exported, so this example works off of a small PR that I just opened to add samples_path
to the official API.
I could find a merged PR for seekable zstd which might be used for channels that don’t fit into memory.
Yeah, I’d love for this to be implemented! Relevant zstd issue here. Ideally support would be implemented in CodecZstd.jl (though I’m not sure if TranscodingStreams’ interface supports random access methods). Adding Onda support would then just be a matter of overloading the appropriate deserialize_lpcm
method for the LPCMZst serializer.
If “Onda is not a … file format”, how would you describe it? Is it a directory layout structure with sensible defaults? Or is Onda just the interface specification for a format?
It’s kind of both It’s a specification for organizing signals/recordings into a file hierarchy alongside a metadata schema that enables multi-sensor recordings to be treated generically as collections of LPCM sample data. The trick is keeping the format as simple as possible so that reader/writer implementations can be made highly amenable to domain-specific specialization, while still enforcing enough standardization that the format provides useful guarantees for downstream ingest/analysis processes/tooling. To this end, a lot of the format’s design work (and trial-by-fire testing) has revolved around tweaking the specification to mandate only enough structure to “get the job done”, and no moreso.
Of course, it’s forever a work in progress for example, I’d like to at least resolve (and battle-test) some of the items currently in the issue tracker before considering a 1.0 release.
Concerning the extensibility part of the specification: Would you assume that people write Onda wrappers for their file formats, so that the access pattern demonstrated in the “tour” through Onda.jl can be reused? If so, which methods need to be implemented to allow this?
For a specific example of extending Onda.jl with a new file format, see the FLAC example.
More generally, Onda’s extensibility comes in a few different flavors:
- The mandatory signal encoding parameters should be sufficient to capture most useful LPCM encodings.
- You can basically encode whatever structure you’d like into Onda’s sparse key-value annotations, and dense categorical annotations can simply be treated as new signals (perhaps it wouldn’t be a bad idea for the format to define a structure for categorical
sample_unit
s…).
- The recording objects’
custom
field allows dataset authors to associate whatever values they like with each recording.
- The aforementioned ability to (de)serialize signal data to/from arbitrary file formats. Granted, the specification provides little aid to implementations in the way of an actual mechanism by which new (de)serialization methods should be incorporated; it just requires that it’s possible. For now, the onus is essentially on the dataset author to guarantee to their consumers that included file formats are actually readable. In practice, I think this is a reasonable expectation. Also, at least in the Julia world, you can imagine getting a lot of convenient fallbacks “for free” by e.g. hooking up FileIO.jl to Onda.jl’s (de)serializer interface.
- Arbitrary additional content is allowed to be placed in the
.onda
directory. While reader/writer implementations can’t necessarily make use of such additional content automatically, nothing precludes the inclusion of code (or a link to code) that provides whatever additional features the dataset author wishes to share.