[ANN] OndaBatches.jl: Continuous and distributed batching for Onda-formatted datasets

dave.f.kleinschmidt · February 13, 2023, 6:12pm

We (Beacon Biosignals) have recently open-sourced OndaBatches.jl, which encapsulates a set of patterns we’ve found useful for our machine-learning workflows. For high-level context, we’re often working with inconveniently large multi-channel time-series datasets (based on the Onda format, hence the name) paired with dense or sparse labels in the form of “annotations” (defined by the time span they cover). The goal is not to provide a complete plug-and-play solution but rather to make it easier for our machine learning teams to stand up new projects!

In short, this package provides

A set of abstractions for working with a “labeled signal” (combining an Onda.Signal and a collection of labels) in a more ML-friendly way (converting span-based labels into regularly-sampled time series aligned with the signal data)
Support for loading only specific byte ranges for an Onda.Signal from S3 (lite type piracy)
A set of abstractions for stateless batch iteration (e.g., pseudo-randomly selecting recordings/spans/etc. based on some user-defined weighting)
Tools for distributing the batch materialization work across a Distributed.jl-based cluster (e.g., addprocs, K8sClusterManagers.jl, etc.)

The last bit is probably the most complex and also something we’re optimistic we might be able to get rid of in the long-term. We’re particularly interested in exploring approaches that use tools like Dagger atop K8s (ahem @jpsamaroo) to possibly obviate/improve this aspect of the package, and/or higher-throughput S3 access (a big motivation for developing this was the difficulties we’ve had getting enough throughput directly from S3 to saturate our training pipelines)

Why are we open sourcing this? Beyond the basic motivation to share the things we’ve made that might be of broad use and/or interest, we’re hoping that having more eyes on this work will reveal better solutions that are out there! We’d love to see which other julia packages might help us solve these kinds of distributed computing on big sample-data-blobs problems at Beacon, so please don’t hesitate to reach out (here, slack, or via email).

Topic		Replies	Views
[ANN] Onda.jl: A format for multi-sensor, multi-channel, LPCM-encodable recordings Package Announcements	12	1455	January 23, 2020
Banyan Julia - large-scale Julia data frames, images, arrays, ML models, and more Package Announcements announcement , images , distributed , dataframes , machine-learning	0	555	June 30, 2022
EEG.jl -> Present and Future New to Julia question , package	44	3190	August 18, 2021
Distributed Julia in the cloud Julia at Scale	18	2259	March 10, 2022
Distributed ODE solving Modelling & Simulations diffeq , distributed	5	159	December 10, 2024

[ANN] OndaBatches.jl: Continuous and distributed batching for Onda-formatted datasets

Related topics