How is the data ecosystem right now for large datasets?

ExpandingMan · June 15, 2017, 7:53pm

I am in a similar situation right now. As far as I can tell, DataTables will give pretty good performance, but I’m not really doing much grouping and joining, most of my work has to do with getting tables into machine-ingestable formats.

Here has been my approach, though I have no idea how similar our use cases are:

I implement the DataStreams interface everywhere that doesn’t already have it. This provides a bare minimal interface to tabular formats that can be plugged into larger interfaces. SQL is a huge pain in the ass and a significant obstacle to getting this kind of thing to work nicely. Part of the reason why SQL is so terrible is that there are about a billion different implementations of it, all of which are completely different but all of which claim to be SQL. I really wish it would just go away.
I dump data into feather (Feather.jl) files. I have a PR which allows me to access individual fields in feather files. The PR is in limbo right now because @quinnj is overhauling DataStreams, but I expect it’ll be merged after being changed to be compatible with whatever the new DataStreams interface looks like.
I have written a package Estuaries.jl which allows me to pull views from anything that implements the DataStreams interface. Estuaries itself has a DataTables-like interface. I haven’t yet put any thought into using it for doing grouping and joining.
I’m currently working on a systematic way of getting data into machine-ingestable formats in TheDataMustFlow.jl. It’s very much a work-in-progress, but at least in principle it allows me to apply machine-learning to any dataset which implements the DataStreams interface (even those too large to fit in memory) and I have designed it to be able run in parallel or on distributed systems (not something I’ve tried yet).

Again, that’s not saying much about grouping and joining, currently my thinking is that I’ll try to do stuff like that on whatever DB stuff I’m forced to integrate with. A lot of people around me insist on using Python so I suspect that I’ll eventually make more use of Pandas.jl for interoperability purposes, at which point I’ll write a DataStreams interface for it as well.

Topic		Replies	Views
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9805	January 1, 2025
DataTables or DataFrames? Data question	32	15663	November 19, 2018
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7943	August 27, 2021
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	11385	July 4, 2022
Julia performs poorly on group-by benchmarks Data performance	48	6052	January 23, 2019

How is the data ecosystem right now for large datasets?

Related topics