Today Julia Computing is excited to announce JuliaDB.jl (https://github.com/JuliaComputing/JuliaDB.jl), a package for working with large persistent data sets. It is still at a fairly early stage, but we wanted to release it as soon as we had meaningful functionality.
JuliaDB ties together several existing packages, including Dagger.jl and IndexedTables.jl. You can feed it a pile of CSV files, and it will (1) build and save an index of the contents of those files, (2) optionally “ingest” the data, which converts it to a more efficient mmap-able file format. From there, you can open and operate on a dataset, and the package will handle loading and storing only the necessary blocks from and to disk. This works with Julia’s distributed parallelism, and also supports out-of-core computation via Dagger.
We saw a need for an end-to-end, all-Julia data analysis platform incorporating storage, parallelism, and compute into a single model. We hope this package can eventually become a standard choice for managing persistent array and tabular data for Julia users. To get things started, our focus so far has been on multi-file tabular datasets, especially time series. However, we are trying to design the system to use a general index space model, making it possible to handle both dense and sparse data of any size and dimensions, working only with meaningful indices instead of file names.
We look forward to collaborating with everybody to realize this goal.
will this package be able to deal with missing data?
Yes. The data columns can contain missing values, but the index columns have to be non-null. By default it will detect and load null columns as NullableArrays. But that may change in the future to Vector{Union{Void,T}}.
This definitely lloks useful and interesting. It sounds as if this is intended to replace all the DataFrames/DataTables/NullableArrays etc functionality – is that so? In that case it would be very nice with some info, a blog post etc., explaining how the package deals with all the different issues and discussions there has been on this topic, e.g. DataTables or DataFrames? and https://github.com/JuliaStats/DataFrames.jl/issues/1092
This was written to solve a particular need (fast analytics on out of core datasets), and in doing so takes a certain design path – using indexing and Dagger. So I doubt that there is an intention to replace anything, just an expectation that this is useful to many users.
It looks very interesting! I’d also like to see how this compares with DataFrames. Something that I’m curious about and I’m not sure I’ve understood is whether this data structure will provide its own optimised ways of doing general data manipulation (e.g. by, groupby, join, etc.) that take advantage of the nature of the indexing system, or whether the user should rely on external packages (e.g. Query.jl) for that.
Just to second @avik, I took a look, and this does not at all look like a generic package that is meant to handle all data situations in julia that would replace things like DataFrame. I think it looks fantastic, but it also appears to target a very specific use case. I’m not sure the name helps here, maybe at least add the word distributed somehow, that seems really the core idea here?
I’ve got a very crude integration with IterableTables.jlready, and that will integrate this with Query.jl. BUT, that integration will not use all the cool things in JuliaDB at all, i.e. it is a pretty crappy integration. Query.jl is actually set up so that in theory specific data sources can provide their own implementation of the query operators and for example make use of any indices they might have to provide much faster implementations of the various query operators than the default iterator based implementation in the Query.jl package itself. So at least in theory it should actually be feasible to provide an integration of Query.jl with JuliaDB.jl where one writes standard Query.jl queries, and under the hood they use the fast, optimized functions for querying that JuliaDB provides. Having said that, that is the theory, and it would probably be a fair bit of work to pull this off
We’re working on benchmarks; some will be posted fairly soon.
It would definitely be good to support more file formats, especially feather and parquet. So far we get a small amount of compression from PooledArrays (for columns with few unique values), but this is also something we’ll keep working on.
Does fst exist as a file storage format outside of the R package http://www.fstpackage.org/? I got the impression from the documentation that the file storage format is not documented other than in the code and that it is subject to change. Specifically, the page states
Note to users: The binary format used for data storage by the package (the ‘fst file format’) is expected to evolve in the coming months. Therefore, fst should not be used for long-term data storage.
Yes, we can only fairly compare single-process performance. The idea is to get to within reasonable speed of pandas including whatever overhead comes from wrapping IndexedTables.jl with Dagger.jl’s scheduler on a single process, and then demonstrate some speed ups vs single process performance with many processes.