[ANN] OndaBatches.jl: Continuous and distributed batching for Onda-formatted datasets

We (Beacon Biosignals) have recently open-sourced OndaBatches.jl, which encapsulates a set of patterns we’ve found useful for our machine-learning workflows. For high-level context, we’re often working with inconveniently large multi-channel time-series datasets (based on the Onda format, hence the name) paired with dense or sparse labels in the form of “annotations” (defined by the time span they cover). The goal is not to provide a complete plug-and-play solution but rather to make it easier for our machine learning teams to stand up new projects!

In short, this package provides

  • A set of abstractions for working with a “labeled signal” (combining an Onda.Signal and a collection of labels) in a more ML-friendly way (converting span-based labels into regularly-sampled time series aligned with the signal data)
  • Support for loading only specific byte ranges for an Onda.Signal from S3 (lite type piracy)
  • A set of abstractions for stateless batch iteration (e.g., pseudo-randomly selecting recordings/spans/etc. based on some user-defined weighting)
  • Tools for distributing the batch materialization work across a Distributed.jl-based cluster (e.g., addprocs, K8sClusterManagers.jl, etc.)

The last bit is probably the most complex and also something we’re optimistic we might be able to get rid of in the long-term. We’re particularly interested in exploring approaches that use tools like Dagger atop K8s (ahem @jpsamaroo) to possibly obviate/improve this aspect of the package, and/or higher-throughput S3 access (a big motivation for developing this was the difficulties we’ve had getting enough throughput directly from S3 to saturate our training pipelines)

Why are we open sourcing this? Beyond the basic motivation to share the things we’ve made that might be of broad use and/or interest, we’re hoping that having more eyes on this work will reveal better solutions that are out there! We’d love to see which other julia packages might help us solve these kinds of distributed computing on big sample-data-blobs problems at Beacon, so please don’t hesitate to reach out (here, slack, or via email).