Today Julia Computing is excited to announce JuliaDB.jl (https://github.com/JuliaComputing/JuliaDB.jl), a package for working with large persistent data sets. It is still at a fairly early stage, but we wanted to release it as soon as we had meaningful functionality.
JuliaDB ties together several existing packages, including Dagger.jl and IndexedTables.jl. You can feed it a pile of CSV files, and it will (1) build and save an index of the contents of those files, (2) optionally “ingest” the data, which converts it to a more efficient mmap-able file format. From there, you can open and operate on a dataset, and the package will handle loading and storing only the necessary blocks from and to disk. This works with Julia’s distributed parallelism, and also supports out-of-core computation via Dagger.
We saw a need for an end-to-end, all-Julia data analysis platform incorporating storage, parallelism, and compute into a single model. We hope this package can eventually become a standard choice for managing persistent array and tabular data for Julia users. To get things started, our focus so far has been on multi-file tabular datasets, especially time series. However, we are trying to design the system to use a general index space model, making it possible to handle both dense and sparse data of any size and dimensions, working only with meaningful indices instead of file names.
We look forward to collaborating with everybody to realize this goal.