I am in a similar situation right now. As far as I can tell, DataTables will give pretty good performance, but I’m not really doing much grouping and joining, most of my work has to do with getting tables into machine-ingestable formats.
Here has been my approach, though I have no idea how similar our use cases are:
- I implement the DataStreams interface everywhere that doesn’t already have it. This provides a bare minimal interface to tabular formats that can be plugged into larger interfaces. SQL is a huge pain in the ass and a significant obstacle to getting this kind of thing to work nicely. Part of the reason why SQL is so terrible is that there are about a billion different implementations of it, all of which are completely different but all of which claim to be SQL. I really wish it would just go away.
- I dump data into feather (Feather.jl) files. I have a PR which allows me to access individual fields in feather files. The PR is in limbo right now because @quinnj is overhauling DataStreams, but I expect it’ll be merged after being changed to be compatible with whatever the new DataStreams interface looks like.
- I have written a package Estuaries.jl which allows me to pull views from anything that implements the DataStreams interface. Estuaries itself has a DataTables-like interface. I haven’t yet put any thought into using it for doing grouping and joining.
- I’m currently working on a systematic way of getting data into machine-ingestable formats in TheDataMustFlow.jl. It’s very much a work-in-progress, but at least in principle it allows me to apply machine-learning to any dataset which implements the DataStreams interface (even those too large to fit in memory) and I have designed it to be able run in parallel or on distributed systems (not something I’ve tried yet).
Again, that’s not saying much about grouping and joining, currently my thinking is that I’ll try to do stuff like that on whatever DB stuff I’m forced to integrate with. A lot of people around me insist on using Python so I suspect that I’ll eventually make more use of Pandas.jl for interoperability purposes, at which point I’ll write a DataStreams interface for it as well.