I’ve been using JuliaDB and absolutely loving it . It helped me restructure my data and I’m now able to process my data in about 40 LOC. It all makes sense.
In my application, I’m joining and grouping sources together to distill a table that contains all the data needed for the final step. This final step is costly (each iteration – each row – takes a few seconds). What I’m missing though, is some sort of “piping”:
I’m executing this costly step on each row with a groupby (so it groups the data and then applies the step). The result of which is the final product (the groupby returns a table when it’s done, not per row). But since I have hundreds of rows and each row is slow, stopping the process midway causes all the data to get lost (why stop the process one might ask, good question). What would be good is some DataFrames.groupby that iterates over the groups and has some side-effect (like saving or piping it to a sink). As I’m writing this, I figure I could just create a grouped table (that’s of course very fast), and then in a for-loop iterate over the rows, saving the results as I go. Yea, that’s basically the same.
OK, I’ll post this just in case someone has a better suggestion. Sorry for the somewhat vague post
The @groupby clause iterates Groupings. A Grouping is an AbstractArray that holds the rows that belong to that group, plus you can call key(g) on the group to retrieve the key of that group.
In theory @map should normally be side-effect free, but I guess there is no real harm in adding some diagnostic output like here.
You can probably also do groupby(identity, t, by; select = ...) to get a table where one column has the grouped tables and then work on it. This should be reasonably fast as you are just taking views of the original.
Yes! So I ended up creating a new table with groupby, as I mentioned in the beginning, and haven’t YET tried your LightQuery. As an aside, it’s a bit overwhelming to test all the different query-like APIs out there, so things take time, a lot of time. One thing is for sure, the more packages I try the better my data becomes… Funny.
@bramtayl I’m curious, do you know how well LightQuery works with out-of-core JuliaDB datasets? Per the docs, there’s a much more limited subset of processes that can be run once it’s out-of-core, and I’m not entirely sure where LightQuery sits in all of the Query frameworks out there.
@versipellis I haven’t tried. LightQuery is in a bit of a state of flux right now, as I’m simultaneously adding SQL support to Query and LightQuery. But I suspect the answer will be yes, provided that