ANN: JuliaDB.jl

We’re working on benchmarks; some will be posted fairly soon.

It would definitely be good to support more file formats, especially feather and parquet. So far we get a small amount of compression from PooledArrays (for columns with few unique values), but this is also something we’ll keep working on.

Don’t forget fst and feather.

Any non-official previous results?

Don’t forget fst and feather.

Fair enough, but they correspond only to a part of JuliaDB.jl’s functionality, namely serialization and deserialization.

Does fst exist as a file storage format outside of the R package http://www.fstpackage.org/? I got the impression from the documentation that the file storage format is not documented other than in the code and that it is subject to change. Specifically, the page states

Note to users: The binary format used for data storage by the package (the ‘fst file format’) is expected to evolve in the coming months. Therefore, fst should not be used for long-term data storage.

Comparing IndexedTables.jl with pandas seems to make a lot of sense, but the stuff in JuliaDB.jl seems quite different, right?

Currently the way JuliaDB handles distributed datasets and storage is pretty tied to the index concept, so it’s related.

I would be fine with a non-descriptive name if we can think of a good one.

Yes, we can only fairly compare single-process performance. The idea is to get to within reasonable speed of pandas including whatever overhead comes from wrapping IndexedTables.jl with Dagger.jl’s scheduler on a single process, and then demonstrate some speed ups vs single process performance with many processes.

Why single-core?
Do you mean single core or single thread?

Because I believe pandas, for example, doesn’t come with multi-processing/multi-threading.

Could somebody please clarify what “distributed datasets” mean in this context? Coming from Hadoop and distributed databases I understand it as a set of files stored on multiple machines with the ability to run particular code or a query locally without copying data over a network. However, in a description of both - JuliaDB.jl and Dagger.jl - I can see only examples of loading data from a local disk and maybe copying it to other machines for processing.

In other words, is JuliaDB.jl more similar to DataTables or to Hadoop?

It is a fully distributed model; it should be possible to use it with files stored on multiple machines. Also, we don’t load the data and then copy it to a remote machine in two steps, rather the remote machine that will handle a certain chunk does the loading itself.

1 Like

Are there any examples of such a workflow? So far it sounds like it may become a huge thing for machine learning / big data community because existing systems like Apache Spark aren’t really well-suited for scientific computations.

because they dont have n dim distributed array?

Why so much interest in python’s pandas?

They have distributed matrices if this is what you mean. Not sure about higher-order arrays, though. But there are many other issues such as:

  1. Java and Scala aren’t really well-suited for data science or machine learning - too verbose, too little libraries, too complicated to extend for most users. Python and R impose a serious performance loss.
  2. Due to the nature of HDFS, you cannot control data distribution, e.g. you cannot collocate data and use local optimizations.
  3. Data on HDFS cannot be modified. If you want to modify data, you have to save it to a different location.
  4. Asynchronous processing is not supported. For example, SGD in Spark is implemented as synchronous iterations over a set of partitions, each being processed by a separate worker. If one of the partitions is significantly larger than the others, or if one worker is significantly slower, all the other workers will wait for him instead of taking the next (as e.g. Google’s Downpour SGD does), etc.

If JuliaDB.jl can handle terabyte-sized datasets without imposing Spark’s and HDFS’s limitations, it can become a very attractive framework for a much larger audience than Julia users.

3 Likes

Could I ask you to say a little more about the specific use case? I saw this line

and the docs’ “Work with data, the Julia way”, which was had me thinking this was a replacement for DataFrames. I realize now that the keyword was “persistent”, but I’m still not completely sure where it slots into the data ecosystem. Is it the ability to work with data on several computer cores at once that sets it apart, or is this a type of database? I read the docs (of course) but I am still not completely sure.

Sorry for the naive question, but as a biologist (rather than a programmer) I feel like I am missing something here that is obvious to the other commenters on the thread. My rationale for asking is that I do work a lot with data and I’m thus very interested in the DataFrames/DataTables/IterableTables/IndexedTables debate.

Can it read gzipped files somehow? I have some datasets for which this would be ideal, stored in compressed (gzip -9) format on my HD, but uncompressing them would take 10x the space. But possibly the mmap-based design precludes this.

https://github.com/JuliaComputing/JuliaDB.jl/issues/48
I’m working on it right now Tamas.

Do you guys know how this compares to Anaconda’s Blaze (https://docs.continuum.io/docs_oss/blaze/)? Are the long term objectives of the projects similar? I know nothing about distributed data sets so I apologize if this is a silly question.

Can this naming discussion be split out in its own thread please?

5 Likes