ANN: JuliaDB.jl

shashi · May 8, 2017, 7:28pm

Because I believe pandas, for example, doesn’t come with multi-processing/multi-threading.

dfdx · May 8, 2017, 8:00pm

Could somebody please clarify what “distributed datasets” mean in this context? Coming from Hadoop and distributed databases I understand it as a set of files stored on multiple machines with the ability to run particular code or a query locally without copying data over a network. However, in a description of both - JuliaDB.jl and Dagger.jl - I can see only examples of loading data from a local disk and maybe copying it to other machines for processing.

In other words, is JuliaDB.jl more similar to DataTables or to Hadoop?

jeff.bezanson · May 8, 2017, 8:15pm

It is a fully distributed model; it should be possible to use it with files stored on multiple machines. Also, we don’t load the data and then copy it to a remote machine in two steps, rather the remote machine that will handle a certain chunk does the loading itself.

dfdx · May 8, 2017, 8:27pm

Are there any examples of such a workflow? So far it sounds like it may become a huge thing for machine learning / big data community because existing systems like Apache Spark aren’t really well-suited for scientific computations.

datnamer · May 8, 2017, 8:54pm

because they dont have n dim distributed array?

Juan · May 8, 2017, 11:01pm

Why so much interest in python’s pandas?

dfdx · May 8, 2017, 11:03pm

They have distributed matrices if this is what you mean. Not sure about higher-order arrays, though. But there are many other issues such as:

Java and Scala aren’t really well-suited for data science or machine learning - too verbose, too little libraries, too complicated to extend for most users. Python and R impose a serious performance loss.
Due to the nature of HDFS, you cannot control data distribution, e.g. you cannot collocate data and use local optimizations.
Data on HDFS cannot be modified. If you want to modify data, you have to save it to a different location.
Asynchronous processing is not supported. For example, SGD in Spark is implemented as synchronous iterations over a set of partitions, each being processed by a separate worker. If one of the partitions is significantly larger than the others, or if one worker is significantly slower, all the other workers will wait for him instead of taking the next (as e.g. Google’s Downpour SGD does), etc.

If JuliaDB.jl can handle terabyte-sized datasets without imposing Spark’s and HDFS’s limitations, it can become a very attractive framework for a much larger audience than Julia users.

mkborregaard · May 9, 2017, 12:14pm

Could I ask you to say a little more about the specific use case? I saw this line

and the docs’ “Work with data, the Julia way”, which was had me thinking this was a replacement for DataFrames. I realize now that the keyword was “persistent”, but I’m still not completely sure where it slots into the data ecosystem. Is it the ability to work with data on several computer cores at once that sets it apart, or is this a type of database? I read the docs (of course) but I am still not completely sure.

Sorry for the naive question, but as a biologist (rather than a programmer) I feel like I am missing something here that is obvious to the other commenters on the thread. My rationale for asking is that I do work a lot with data and I’m thus very interested in the DataFrames/DataTables/IterableTables/IndexedTables debate.

Tamas_Papp · May 9, 2017, 1:25pm

Can it read gzipped files somehow? I have some datasets for which this would be ideal, stored in compressed (gzip -9) format on my HD, but uncompressing them would take 10x the space. But possibly the mmap-based design precludes this.

hpoit · May 9, 2017, 1:34pm

https://github.com/JuliaComputing/JuliaDB.jl/issues/48
I’m working on it right now Tamas.

adriano.vilela · May 9, 2017, 1:52pm

Do you guys know how this compares to Anaconda’s Blaze (https://docs.continuum.io/docs_oss/blaze/)? Are the long term objectives of the projects similar? I know nothing about distributed data sets so I apologize if this is a silly question.

kristoffer.carlsson · May 11, 2017, 3:54pm

Can this naming discussion be split out in its own thread please?

mkborregaard · May 11, 2017, 5:09pm

One thing that would be amazing (to me) is a conversation about what exactly this package does, and how it relates to the other core data handling packages in the julia package ecosystem.

mbauman · May 11, 2017, 5:41pm

Alright, things were fairly intertwined, but I did my best to split off the naming discussion into: The naming of JuliaDB.jl

One portion I was a little sad to see go was:

Let’s try to keep this thread focused the functionality of JuliaDB.

jeff.bezanson · May 11, 2017, 5:50pm

The goal of this package is to be more end-to-end, providing something at a similar level of abstraction as a traditional database: create, update, and query operations connected to a persistent data store. So for now you could see it as just a stand-alone thing. However eventually it should become an umbrella package, or perhaps just a set of standard APIs, that connect appropriate data structures, file formats, and operations into a standard workflow.

shashi · May 11, 2017, 5:51pm

It would be great to support reading compressed files. I was thinking of going about this using FileIO.jl or similar abstraction to somehow read zip and gz files.

mkborregaard · May 11, 2017, 5:55pm

Thanks - this sounds very promising and exciting.

hpoit · May 11, 2017, 6:10pm

Why would this implementation not be straight forward? It sounds like a good plan.

js135005 · May 11, 2017, 6:19pm

It should be fairly straightforward to incorporate the buffered input streams from Libz. Behind the scenes, I am working on an alternative to ZipFile for zip archives that doesn’t have a 20x slowdown compared to reading uncompressed files directly. Unfortunately no guarantee as to when I might have something like that working.

mbauman · November 13, 2018, 1:59am

A post was split to a new topic: Package for reading/writing ~100GB data files

Topic		Replies	Views
JuliaData BoF @ JuliaCon2023 discussion Data discussion	2	462	August 14, 2023
[ANN] DataFrameDBs.jl Data package , announcement	60	3981	May 2, 2020
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10478	July 4, 2022
Difference between JuliaDB and DataFrames Data	13	1877	June 17, 2021
[ANN] New and Improved JuliaDB Community package , announcement	14	2801	August 7, 2018

ANN: JuliaDB.jl

Related topics