I’m trying to understand to what extent the functionality of Query.jl can replace data manipulations that are already implemented in the DataFrames.jl. So far I’m impressed with Query’s flexibility, but there are a couple of things that I can do in DataFrames and didn’t manage to implement in a pure Query way (which I’d like to, so that the implementation would work with any IterableTable):
-
groupby
with an arbitrary number of columns. In DataFrames I can do:
julia> using DataFrames, RDatasets
julia> school = RDatasets.dataset("mlmRev","Hsb82");
julia> v = [:Sx, :Minrty]
2-element Array{Symbol,1}:
:Sx
:Minrty
julia> df = by(school, v, dd -> DataFrame(n = size(dd,1)))
4×3 DataFrames.DataFrame
│ Row │ Sx │ Minrty │ n │
├─────┼──────────┼────────┼──────┤
│ 1 │ "Male" │ "No" │ 2481 │
│ 2 │ "Male" │ "Yes" │ 909 │
│ 3 │ "Female" │ "No" │ 2730 │
│ 4 │ "Female" │ "Yes" │ 1065 │
to know how many datapoints there are for any combination of :Sx
and :Minrty
. Of course in the case of two variables I can do it by hand in Query but I’m not sure how to implement an @group
where I split by an arbitrary number of variables.
- reshaping. In DataFrames if I now want to compare number of males vs females, I can
julia> unstack(d,:Sx, :n)
2×3 DataFrames.DataFrame
│ Row │ Minrty │ Male │ Female │
├─────┼────────┼──────┼────────┤
│ 1 │ "No" │ 2481 │ 2730 │
│ 2 │ "Yes" │ 909 │ 1065 │
Is there some clean way of implementing these two things in Query that I’m missing? If there isn’t, is it because of lack of time/will be implemented in the future, or there are fundamental reasons why this is easier to implement in DataFrames than it is in Query?
Thanks!