Because I believe
pandas, for example, doesn’t come with multi-processing/multi-threading.
Because I believe
Could somebody please clarify what “distributed datasets” mean in this context? Coming from Hadoop and distributed databases I understand it as a set of files stored on multiple machines with the ability to run particular code or a query locally without copying data over a network. However, in a description of both - JuliaDB.jl and Dagger.jl - I can see only examples of loading data from a local disk and maybe copying it to other machines for processing.
In other words, is JuliaDB.jl more similar to DataTables or to Hadoop?
It is a fully distributed model; it should be possible to use it with files stored on multiple machines. Also, we don’t load the data and then copy it to a remote machine in two steps, rather the remote machine that will handle a certain chunk does the loading itself.
Are there any examples of such a workflow? So far it sounds like it may become a huge thing for machine learning / big data community because existing systems like Apache Spark aren’t really well-suited for scientific computations.
because they dont have n dim distributed array?
Why so much interest in python’s pandas?
They have distributed matrices if this is what you mean. Not sure about higher-order arrays, though. But there are many other issues such as:
- Java and Scala aren’t really well-suited for data science or machine learning - too verbose, too little libraries, too complicated to extend for most users. Python and R impose a serious performance loss.
- Due to the nature of HDFS, you cannot control data distribution, e.g. you cannot collocate data and use local optimizations.
- Data on HDFS cannot be modified. If you want to modify data, you have to save it to a different location.
- Asynchronous processing is not supported. For example, SGD in Spark is implemented as synchronous iterations over a set of partitions, each being processed by a separate worker. If one of the partitions is significantly larger than the others, or if one worker is significantly slower, all the other workers will wait for him instead of taking the next (as e.g. Google’s Downpour SGD does), etc.
If JuliaDB.jl can handle terabyte-sized datasets without imposing Spark’s and HDFS’s limitations, it can become a very attractive framework for a much larger audience than Julia users.
Could I ask you to say a little more about the specific use case? I saw this line
and the docs’ “Work with data, the Julia way”, which was had me thinking this was a replacement for DataFrames. I realize now that the keyword was “persistent”, but I’m still not completely sure where it slots into the data ecosystem. Is it the ability to work with data on several computer cores at once that sets it apart, or is this a type of database? I read the docs (of course) but I am still not completely sure.
Sorry for the naive question, but as a biologist (rather than a programmer) I feel like I am missing something here that is obvious to the other commenters on the thread. My rationale for asking is that I do work a lot with data and I’m thus very interested in the DataFrames/DataTables/IterableTables/IndexedTables debate.
Can it read gzipped files somehow? I have some datasets for which this would be ideal, stored in compressed (
gzip -9) format on my HD, but uncompressing them would take 10x the space. But possibly the mmap-based design precludes this.
I’m working on it right now Tamas.
Do you guys know how this compares to Anaconda’s Blaze (https://docs.continuum.io/docs_oss/blaze/)? Are the long term objectives of the projects similar? I know nothing about distributed data sets so I apologize if this is a silly question.
Can this naming discussion be split out in its own thread please?
One thing that would be amazing (to me) is a conversation about what exactly this package does, and how it relates to the other core data handling packages in the julia package ecosystem.
Alright, things were fairly intertwined, but I did my best to split off the naming discussion into: The naming of JuliaDB.jl
One portion I was a little sad to see go was:
Let’s try to keep this thread focused the functionality of JuliaDB.
The goal of this package is to be more end-to-end, providing something at a similar level of abstraction as a traditional database: create, update, and query operations connected to a persistent data store. So for now you could see it as just a stand-alone thing. However eventually it should become an umbrella package, or perhaps just a set of standard APIs, that connect appropriate data structures, file formats, and operations into a standard workflow.
It would be great to support reading compressed files. I was thinking of going about this using FileIO.jl or similar abstraction to somehow read zip and gz files.
Thanks - this sounds very promising and exciting.
Why would this implementation not be straight forward? It sounds like a good plan.
It should be fairly straightforward to incorporate the buffered input streams from Libz. Behind the scenes, I am working on an alternative to ZipFile for zip archives that doesn’t have a 20x slowdown compared to reading uncompressed files directly. Unfortunately no guarantee as to when I might have something like that working.