Banyan Julia - large-scale Julia data frames, images, arrays, ML models, and more

calebwin · June 30, 2022, 12:40am

I’m excited to make our first public announcement about Banyan Julia - a suite of packages that let you use popular Julia APIs to process massive datasets on and off the cloud (via sampling):

BanyanDataFrames.jl for DataFrames.jl
BanyanImages.jl for Images.jl
BanyanONNXRunTime.jl for ONNXRunTime.jl (for PyTorch/TensorFlow models)
BanyanHDF5.jl for HDF5.jl
BanyanArrays.jl for Array

Most recently, we’ve:

achieved comparable performance with Dask (Coiled) in a preliminary benchmark for a common data analytics task
put together a getting started walk-through video
developed automatic instant big data sampling to reduce data teams’ reliance on expensive and energy-intensive cloud data centers

TLDR: we’re building a platform for eco-friendly large-scale data science with familiar Julia APIs. More details are on our website - BanyanComputing.com. (PS - it’s a cloud product so if you want something on-prem then look at Dagger.jl, Distributed, or MPI.jl)

PPS - I want to thank the friendly and helpful Julia community including contributors to DataFrames.jl, Images.jl, ONNXRunTime.jl, etc. Without them, this project would not be possible.

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6775	July 13, 2017
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	3046	December 1, 2020
Distributed Julia in the cloud Julia at Scale	18	2302	March 10, 2022
Using DataFrames: ~ 10 seconds General Usage	24	2274	March 19, 2018
Online/out-of-core machine learning (ML) algorithms needs to compete with H20 & Spark Data	13	2352	March 1, 2018

Banyan Julia - large-scale Julia data frames, images, arrays, ML models, and more

Related topics