[ANN] Onda.jl: A format for multi-sensor, multi-channel, LPCM-encodable recordings

Hello!

I’m pleased to announce Beacon Biosignal’s impending open-source release of both Onda.jl and the Onda format. Onda is a lightweight format for storing and manipulating datasets of multi-sensor, multi-channel, LPCM-encodable, annotated, time-series recordings. At Beacon, we use this format to wrangle datasets with thousands of electrophysiological recordings comprised of terabytes of sample data. We’d love to make it useful for you as well!

Note that this isn’t a v1.0 release for the package/format, so don’t expect rock-solid stability just yet. However, I believe the format and accompanying Julia package are stable enough for early adopters to get their feet wet.

As always, please file issues/PRs if you find opportunities for improvement or collaboration!

Best,
Jarrett

20 Likes

Great to see such a package! Here a couple of questions:

  • You mention that its suitable for recordings where at least one signal fits into memory. What are your thoughts on utilizing mmap for larger files? I could find a merged PR for seekable zstd which might be used for channels that don’t fit into memory.
  • If “Onda is not a … file format”, how would you describe it? Is it a directory layout structure with sensible defaults? Or is Onda just the interface specification for a format?
  • Concerning the extensibility part of the specification: Would you assume that people write Onda wrappers for their file formats, so that the access pattern demonstrated in the “tour” through Onda.jl can be reused? If so, which methods need to be implemented to allow this?

Thanks for sharing!

2 Likes

Thanks for the interest and thoughtful questions :slight_smile:

You mention that its suitable for recordings where at least one signal fits into memory.

Ah, actually, these days the format is indeed suitable for larger signals. I’ve opened a PR to update the “useful for” list.

What are your thoughts on utilizing mmap for larger files?

Totally kosher if the format is raw .lpcm. Actually doing so is a bit manual, however…probably wouldn’t be a bad idea to add an mmap option to the load interface:

using Onda, Mmap

include(joinpath(dirname(dirname(pathof(Onda))), "examples", "tour.jl"))

file = open(samples_path(dataset, uuid, :eeg))
S = eeg_signal.sample_type
nrows = channel_count(eeg_signal)
ncols = Int(filesize(file) / sizeof(S) / nrows)
samples = Samples(eeg_signal, true, Mmap.mmap(file, Matrix{S}, (nrows, ncols)))

# now performing normal operations on `Samples` (e.g. decoding only
# a specific region) works the same as it usually does:
decoded_region = decode(view(samples, :, TimeSpan(Second(3), Second(5))))

Note that writing the above made me realize that samples_path wasn’t super ergonomic or exported, so this example works off of a small PR that I just opened to add samples_path to the official API.

I could find a merged PR for seekable zstd which might be used for channels that don’t fit into memory.

Yeah, I’d love for this to be implemented! Relevant zstd issue here. Ideally support would be implemented in CodecZstd.jl (though I’m not sure if TranscodingStreams’ interface supports random access methods). Adding Onda support would then just be a matter of overloading the appropriate deserialize_lpcm method for the LPCMZst serializer.

If “Onda is not a … file format”, how would you describe it? Is it a directory layout structure with sensible defaults? Or is Onda just the interface specification for a format?

It’s kind of both :smiley: It’s a specification for organizing signals/recordings into a file hierarchy alongside a metadata schema that enables multi-sensor recordings to be treated generically as collections of LPCM sample data. The trick is keeping the format as simple as possible so that reader/writer implementations can be made highly amenable to domain-specific specialization, while still enforcing enough standardization that the format provides useful guarantees for downstream ingest/analysis processes/tooling. To this end, a lot of the format’s design work (and trial-by-fire testing) has revolved around tweaking the specification to mandate only enough structure to “get the job done”, and no moreso.

Of course, it’s forever a work in progress :wink: for example, I’d like to at least resolve (and battle-test) some of the items currently in the issue tracker before considering a 1.0 release.

Concerning the extensibility part of the specification: Would you assume that people write Onda wrappers for their file formats, so that the access pattern demonstrated in the “tour” through Onda.jl can be reused? If so, which methods need to be implemented to allow this?

For a specific example of extending Onda.jl with a new file format, see the FLAC example.

More generally, Onda’s extensibility comes in a few different flavors:

  1. The mandatory signal encoding parameters should be sufficient to capture most useful LPCM encodings.
  2. You can basically encode whatever structure you’d like into Onda’s sparse key-value annotations, and dense categorical annotations can simply be treated as new signals (perhaps it wouldn’t be a bad idea for the format to define a structure for categorical sample_units…).
  3. The recording objects’ custom field allows dataset authors to associate whatever values they like with each recording.
  4. The aforementioned ability to (de)serialize signal data to/from arbitrary file formats. Granted, the specification provides little aid to implementations in the way of an actual mechanism by which new (de)serialization methods should be incorporated; it just requires that it’s possible. For now, the onus is essentially on the dataset author to guarantee to their consumers that included file formats are actually readable. In practice, I think this is a reasonable expectation. Also, at least in the Julia world, you can imagine getting a lot of convenient fallbacks “for free” by e.g. hooking up FileIO.jl to Onda.jl’s (de)serializer interface.
  5. Arbitrary additional content is allowed to be placed in the .onda directory. While reader/writer implementations can’t necessarily make use of such additional content automatically, nothing precludes the inclusion of code (or a link to code) that provides whatever additional features the dataset author wishes to share.
2 Likes

Ideally support would be implemented in CodecZstd.jl (though I’m not sure if TranscodingStreams’ interface supports random access methods).

(ref random access/seek support · Issue #18 · JuliaIO/CodecZstd.jl · GitHub)

I. Terabytes of data

There is an option of using LRUCache.jl to store last chunks of data read from big files with random-access.

In general, it should be split in three steps:

  1. request range (or interval) needed for processing,
  2. copy that range into cached chunk(s)
  3. copy chunk(s) to requested array, then work on that array.

Why should we split that in steps:

1-2: No need to read the same data multiple times on overlapping requests. (In general: no need to process any intermediate data twice.)

2-3: There is performance overhead if you combine such cache with array interface directly, since every getindex would seek the right chunk and update usage data.

Actually, you can get rid of step 3 and view requested arrays into individual chunks, if your chunk is larger or equal to the actual requested range. But in that case I’m not sure, how to solve problems of overlapping small and big chunks with different “least usage” count. Maybe adding some interval-keys from IntervalTrees.jl?
.


II. Annotations

  1. What if there are hundreds of thousands of sparse annotations for a long record? Do you provide storing annotation arrays in the same data format, but with additional event dictionary?

  2. What if there are several layers of different annotations? And you request some data depending on combinations of annotated events? Or generating another annotation layer?

  3. What operations with annotation layers you are planning to support?
    .


III. Format or protocol?

  1. What advantages does it provide compared to HDF5 or similar formats?

  2. Or its compatible with HDF5 and can be saved in it?

  3. Or is it a transfer protocol for biosignals?

I’m curious about @jrevels take on your points as I think it would help to better understand the envisioned borders and shape of Onda. In the mean time here are my thoughts:

Caching/Access:
Triple level caching/access might be helpful for some cases, but for what I am interested in, two layers feel sufficient (chunking the data from (compressed) disk storage into memory).

Annotations:
For my usage annotations wouldn’t be that excessive. I’ve the feeling that if you are talking about this kind of numbers (>100000) the annotations are either outputs of some kind of algorithm or a signal of its own. From my point of view, both shouldn’t be handled as annotations.

HDF5
There is already a paragraph describing the reasoning for Onda in comparison to HDF5 (and other formats). https://github.com/beacon-biosignals/OndaFormat:

HDF5 was considered as an alternative to filesystem storage for Onda recording metadata and raw signal artifacts. While featureful, ubiquitous, and technically based on an open standard, HDF5 is infamous for being a hefty dependency with a fairly complex reference implementation. While HDF5 solves many problems inherent to filesystem-based storage, most use cases for Onda involve storing large binary blobs in domain-specific formats that already exist quite naturally as files on a filesystem.

I will play around with Onda next week to gain experience on where and how to use it…

Third level appears as soon as you copy, say, 1500 points from cached chunks of 1000 points.

Yes, I’m talking about annotations as one of several signal types, known as labels or segmentation. So, they are indeed output of some algorithm. Especially if we are talking about tebadytes of data, because you cannot manually annotate terabytes in a reasonable amount of time. You can only review a small portion of automatically labeled/annotated data.

I’ts not about HDF5 limitations, because there are different similar formats like Arrow, Zarr, Exdir, TileDB, N5, Z5, etc.

If Onda is a layer above already stored files of different formats (something like file database), then it is more about mapping different formats with software to read from them, mapping data from files to metadata, and taking special attention to metadata structures that are added on top of those files. Here are some thoughts on working with metadata: Do you use some file database with tagging / multiple grouping functionality?

When working with data at this scale, computations do indeed usually need some form (and usually multiple forms) of caching!

However, the most appropriate cache is depends on your computation/access pattern, which is orthogonal to (or at least overtop of) Onda itself as a format. At Beacon, for example, we have some workloads that read the same segments repeatedly, but others that sweep through all segments and load each only once. Onda.jl makes it easy/possible to read segments by TimeSpan but (rightfully, IMO) isn’t opinionated w.r.t. caching.

We do have a nice little LRU cache implementation that auto-spills to disk, though - should probably upstream that to LRUCaches.jl or somewhere :slight_smile: Would love to see more cache utilities on top of - but not within - Onda.jl to facilitate common access patterns (or better, composing with Onda.jl without needing to explicitly depend on it).

For my usage annotations wouldn’t be that excessive. I’ve the feeling that if you are talking about this kind of numbers (>100000) the annotations are either outputs of some kind of algorithm or a signal of its own. From my point of view, both shouldn’t be handled as annotations.

Yup, this is how we treat it; for us, at least, the natural thing is to treat “dense annotations” as signals in their own right.

with additional event dictionary

There’s no canonical specification of this in the format, though you could roll your own however you’d like. Defining a canonical spec for this might not be a bad idea, though - this what I meant by “perhaps it wouldn’t be a bad idea for the format to define a structure for categorical sample_unit s…”.

Onda is a layer above already stored files of different formats

Yes, as well as a format for structuring signal metadata + data model that allows you treat all those files similarly as LPCM signals encoded in their own ways.

Do you use some file database with tagging / multiple grouping functionality?

Right now we’re just using S3 for object storage; ingest of new Onda datasets is then just a matter of shoving the metadata into a database (that indexes the S3 objects) and the sample data into S3.

Or its compatible with HDF5 and can be saved in it?

An Onda dataset is just a directory with a fairly simple structure, so I don’t see why it couldn’t be saved in HDF5 AFAICT. The comparison paragraph linked above intends to explain why Onda as format isn’t defined on top of HDF5, but I should update it to make it clear that Onda isn’t incompatible with HDF5 (purposefully, anyway). EDIT ref clarify Onda's relationship with HDF5 by jrevels · Pull Request #12 · beacon-biosignals/OndaFormat · GitHub

1 Like

Are there any plans to incorporate this in some way with BIDSTools.jl? I appreciate that there are a lot of nuances in optimizing data IO and processing but there are so many data formats and variations in the technical details when it comes to neuroscience data that it seems a bit overwhelming to have to learn another one.

No concrete ones, no - though of course I understand the pain of “yet another format” :grimacing:

IMO Onda targets a separate use case, however. IIUC, BIDS is focused on providing an ontology for organizing neuro-specific datasets that map very closely to experimental practice for its target domain, while Onda is focused on providing an intermediate format for ingest/digest of bulk multi-channel/multisensor LPCM recording datasets that supports extensions for domain-specific encodings. There’s overlap, but it’s pretty apples-to-oranges in many respects. Importantly, since Onda isn’t tied to e.g. the neuro domain, it purposefully doesn’t have much of an “opinion” on metadata unrelated to (de)serialization of the signal data + annotations.

It definitely is focused on neuroscience research, but they keep expanding. It originally was just imaging, then incorporated EEG, then psychological testing encoding, etc… I like BIDS because it benefits from multiple experts in different labs working together on what info they all find necessary. I think the file system organization is a nice bonus to all of that.

However, BIDS isn’t perfect and if you can improve upon the situation that would be awesome. It may be worth pointing out that electorphysiology files seem to be pretty flexible under BIDS and you might be able to propose a drop in replacement to the standard. It would be nice to see some discussion in their repositories that is centered more around improving performance.

A little off-topic: but what is a good I/O package (if one exists) for EEG/MEG data?

There are several.
https://github.com/beacon-biosignals/EDF.jl
https://github.com/wherrera10/EDFPlus.jl

Electrophysiology has so many file formats though that it’s difficult to just choose one. This is why something like the Onda format may be useful, but they would need to work with a number of communities before it could become a standard.

1 Like