Comparing DataFrames native API and Query


#1

I’m trying to understand to what extent the functionality of Query.jl can replace data manipulations that are already implemented in the DataFrames.jl. So far I’m impressed with Query’s flexibility, but there are a couple of things that I can do in DataFrames and didn’t manage to implement in a pure Query way (which I’d like to, so that the implementation would work with any IterableTable):

  • groupby with an arbitrary number of columns. In DataFrames I can do:
julia> using DataFrames, RDatasets

julia> school = RDatasets.dataset("mlmRev","Hsb82");

julia> v = [:Sx, :Minrty]
2-element Array{Symbol,1}:
 :Sx    
 :Minrty

julia> df = by(school, v, dd -> DataFrame(n = size(dd,1)))
4×3 DataFrames.DataFrame
│ Row │ Sx       │ Minrty │ n    │
├─────┼──────────┼────────┼──────┤
│ 1   │ "Male"   │ "No"   │ 2481 │
│ 2   │ "Male"   │ "Yes"  │ 909  │
│ 3   │ "Female" │ "No"   │ 2730 │
│ 4   │ "Female" │ "Yes"  │ 1065 │

to know how many datapoints there are for any combination of :Sx and :Minrty. Of course in the case of two variables I can do it by hand in Query but I’m not sure how to implement an @group where I split by an arbitrary number of variables.

  • reshaping. In DataFrames if I now want to compare number of males vs females, I can
julia> unstack(d,:Sx, :n)
2×3 DataFrames.DataFrame
│ Row │ Minrty │ Male │ Female │
├─────┼────────┼──────┼────────┤
│ 1   │ "No"   │ 2481 │ 2730   │
│ 2   │ "Yes"  │ 909  │ 1065   │

Is there some clean way of implementing these two things in Query that I’m missing? If there isn’t, is it because of lack of time/will be implemented in the future, or there are fundamental reasons why this is easier to implement in DataFrames than it is in Query?

Thanks!


#2

You can group by any arbitrary julia expression, including one that extracts multiple columns. With the current version you could write this query in the following way:

@from i in school begin
    @group i by (i.Sx, i.Minrty) into g
    @select {Sx = g.key[1], Minrt = g.key[2], n = length(g)}
    @collect DataFrame
end

Here you are grouping by a tuple that represents the two columns you want to group by.

In the next version of Query (hopefully to be released at the end of the week), you will be able to use the {} named tuple syntax everywhere in a query, and then you can actually use the {} syntax to specify by what you want to group, making the whole query easier to grog:

@from i in school begin
    @group i by {i.Sx, i.Minrty} into g
    @select {g.key.Sx, g.key.Minrt, n = length(g)}
    @collect DataFrame
end

The main difference here is that the key field of each group will now be a named tuple with two fields, and then you can access them by name rather than by position in the named tuple.

They key thing here is that you can essentially specify some arbitrary function that computes something for each row, and then things will be grouped by the return value of that function. That can be also super handy beyond just grouping by more than one column, for example you can easily group a year column by decades by using a clever equation etc.

There is no story for unstack in Query.jl and my current thinking is that there shouldn’t be, but I still might change my mind about that. Right now all the operations in Query.jl have one common feature: you can infer the structure (e.g. columns) of a complete query if you a) know the structure of the source and b) know the query. In particular, the structure of the result never depends on the values of the elements/rows of the source. That property enables a whole bunch of things that are not implemented but are on the roadmap (easy translation into SQL etc.). unstack does not have that property, so I’m hesitant to put it into Query.jl. The tidyverse also handles it that way: dplyr (the closest thing to Query) does not actually have unstack, instead there is a separate package (tidyr) that has a whole bunch of functions around tidying data, and unstack is one of them. So my gut feeling right now is that there should probably be another companion package to Query.jl that has the functionality that the tidyr package in R has. I hope to work on that at some point, but probably not in the very near future.


#3

Thanks a lot for your thorough reply, it was very instructive!

As far as groupby is concerned, I meant a slightly different thing, namely grouping by an arbitrary array of columns. I could have a vector v of symbols that is found programmatically (maybe it has 100 symbols and corresponds to all the columns with a given property). In DataFrames, I can pass the vector to groupby, as in Query I can only pass a named tuple, I would need a way to convert a vector to a named tuple. However I’m not aware of any method to create a named tuple other than typing all the arguments manually.

About unstack, you make a very valid point, however I’m curious whether there could be intermediate solutions: for example one where the user needs to explicitly write what values of the variable column will be used to generate new columns. In this way it would corresponding to selecting a bunch of subdataframes (one per value for each value in this list, the rest of the data should be discarded) and then joining them (which right now is possible but a bit clumsy to write): unstack would then mostly be syntactic sugar for this. This would still be useful in practice and doesn’t seem to violate your requirement.


#4

LazyQuery currently has two backends, one based on DataFrames and one based on Query. You can do all the things you mentioned (including summarizing and unstacking) with the DataFrames backend.


#5

Ah, yes, that doesn’t work right now, and I don’t have a good idea how to get it to work… I’ll keep thinking about it, though, it would certainly be nice to have.

Hm, also an interesting idea. I guess another alternative would be to just have an unstack that works on an iterable table, but is not lazy…