ANN: JuliaDB.jl

jeff.bezanson · May 6, 2017, 7:23pm

Today Julia Computing is excited to announce JuliaDB.jl (https://github.com/JuliaComputing/JuliaDB.jl), a package for working with large persistent data sets. It is still at a fairly early stage, but we wanted to release it as soon as we had meaningful functionality.

JuliaDB ties together several existing packages, including Dagger.jl and IndexedTables.jl. You can feed it a pile of CSV files, and it will (1) build and save an index of the contents of those files, (2) optionally “ingest” the data, which converts it to a more efficient mmap-able file format. From there, you can open and operate on a dataset, and the package will handle loading and storing only the necessary blocks from and to disk. This works with Julia’s distributed parallelism, and also supports out-of-core computation via Dagger.

We saw a need for an end-to-end, all-Julia data analysis platform incorporating storage, parallelism, and compute into a single model. We hope this package can eventually become a standard choice for managing persistent array and tabular data for Julia users. To get things started, our focus so far has been on multi-file tabular datasets, especially time series. However, we are trying to design the system to use a general index space model, making it possible to handle both dense and sparse data of any size and dimensions, working only with meaningful indices instead of file names.

We look forward to collaborating with everybody to realize this goal.

catawbasam · May 7, 2017, 8:58pm

Thanks for opening this up. Looks interesting.

jrklasen · May 8, 2017, 4:35am

will this package be able to deal with missing data?

shashi · May 8, 2017, 5:32am

will this package be able to deal with missing data?

Yes. The data columns can contain missing values, but the index columns have to be non-null. By default it will detect and load null columns as NullableArrays. But that may change in the future to Vector{Union{Void,T}}.

mkborregaard · May 8, 2017, 7:42am

This definitely lloks useful and interesting. It sounds as if this is intended to replace all the DataFrames/DataTables/NullableArrays etc functionality – is that so? In that case it would be very nice with some info, a blog post etc., explaining how the package deals with all the different issues and discussions there has been on this topic, e.g. DataTables or DataFrames? and https://github.com/JuliaStats/DataFrames.jl/issues/1092

avik · May 8, 2017, 9:12am

This was written to solve a particular need (fast analytics on out of core datasets), and in doing so takes a certain design path – using indexing and Dagger. So I doubt that there is an intention to replace anything, just an expectation that this is useful to many users.

piever · May 8, 2017, 9:13am

It looks very interesting! I’d also like to see how this compares with DataFrames. Something that I’m curious about and I’m not sure I’ve understood is whether this data structure will provide its own optimised ways of doing general data manipulation (e.g. by, groupby, join, etc.) that take advantage of the nature of the indexing system, or whether the user should rely on external packages (e.g. Query.jl) for that.

mkborregaard · May 8, 2017, 9:18am

OK, thanks for clarifying.

davidanthoff · May 8, 2017, 5:02pm

Just to second @avik, I took a look, and this does not at all look like a generic package that is meant to handle all data situations in julia that would replace things like DataFrame. I think it looks fantastic, but it also appears to target a very specific use case. I’m not sure the name helps here, maybe at least add the word distributed somehow, that seems really the core idea here?

I’ve got a very crude integration with IterableTables.jl ready, and that will integrate this with Query.jl. BUT, that integration will not use all the cool things in JuliaDB at all, i.e. it is a pretty crappy integration. Query.jl is actually set up so that in theory specific data sources can provide their own implementation of the query operators and for example make use of any indices they might have to provide much faster implementations of the various query operators than the default iterator based implementation in the Query.jl package itself. So at least in theory it should actually be feasible to provide an integration of Query.jl with JuliaDB.jl where one writes standard Query.jl queries, and under the hood they use the fast, optimized functions for querying that JuliaDB provides. Having said that, that is the theory, and it would probably be a fair bit of work to pull this off

ChrisRackauckas · May 8, 2017, 5:10pm

That’s why it’s called Julia Distributed Bytes

mbauman · May 11, 2017, 5:39pm

44 posts were split to a new topic: The naming of JuliaDB.jl

jeff.bezanson · May 8, 2017, 5:51pm

We’re working on benchmarks; some will be posted fairly soon.

It would definitely be good to support more file formats, especially feather and parquet. So far we get a small amount of compression from PooledArrays (for columns with few unique values), but this is also something we’ll keep working on.

Juan · May 8, 2017, 6:02pm

Don’t forget fst and feather.

Any non-official previous results?

shashi · May 8, 2017, 6:08pm

Don’t forget fst and feather.

Fair enough, but they correspond only to a part of JuliaDB.jl’s functionality, namely serialization and deserialization.

dmbates · May 8, 2017, 6:14pm

Does fst exist as a file storage format outside of the R package http://www.fstpackage.org/? I got the impression from the documentation that the file storage format is not documented other than in the code and that it is subject to change. Specifically, the page states

Note to users: The binary format used for data storage by the package (the ‘fst file format’) is expected to evolve in the coming months. Therefore, fst should not be used for long-term data storage.

davidanthoff · May 8, 2017, 6:22pm

Comparing IndexedTables.jl with pandas seems to make a lot of sense, but the stuff in JuliaDB.jl seems quite different, right?

jeff.bezanson · May 8, 2017, 6:28pm

Currently the way JuliaDB handles distributed datasets and storage is pretty tied to the index concept, so it’s related.

I would be fine with a non-descriptive name if we can think of a good one.

shashi · May 8, 2017, 6:34pm

Yes, we can only fairly compare single-process performance. The idea is to get to within reasonable speed of pandas including whatever overhead comes from wrapping IndexedTables.jl with Dagger.jl’s scheduler on a single process, and then demonstrate some speed ups vs single process performance with many processes.

Juan · May 8, 2017, 6:48pm

Why single-core?
Do you mean single core or single thread?

Topic		Replies	Views
The naming of JuliaDB.jl Offtopic	47	4570	June 7, 2017
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7523	August 21, 2020
Why do you use JuliaDB? General Usage	9	2168	October 28, 2019
Package for reading/writing ~100GB data files General Usage	10	2924	November 17, 2018
How is the data ecosystem right now for large datasets? Data	35	6830	July 13, 2017

ANN: JuliaDB.jl

Related topics