df = DataFrame(rand(100,3), :auto)
df.gr=repeat('A':'D'; inner=25)
## Create a simple nested DataFrame
df_nested = combine(groupby(df, :gr), [:x1, :x2, :x3] => ((x,y,z) -> Ref(DataFrame(x1=x, x2=y, x3=z))) => :X_DataFrame)
julia> df_nested
4×2 DataFrame
Row │ gr X_DataFrame
│ Char DataFrame
─────┼──────────────────────
1 │ A 25×3 DataFrame
2 │ B 25×3 DataFrame
3 │ C 25×3 DataFrame
4 │ D 25×3 DataFrame
Questions:
How can I revert df_nested into df? I tried: transfrom(dfc, :X_DataFrame => AsTable) or combine(dfc, :X_DataFrame => AsTable) for instance, but get error keys(::DataFrame) does not exist.
How can I create a more general function, which does the nesting instead of doing it by hand? Something like: nest(gdf:GroupedDataFrame, cols)
Sorry, I still don’t have the best intuition, of the great DataFrames package…
here DataFrame(gr = reduce(vcat, fill.(df_nested.gr, nrow.(df_nested.X_DataFrame)))) creates the vector of group identifiers, reduce(vcat, df_nested.X_DataFrame) creates the DataFrame, and then [x y]hcats them together.
As for generalising this, you can do:
function nest(df::DataFrame, nest_cols; other_cols = nothing)
other_cols = isnothing(other_cols) ? names(df[!, Not(nest_cols)]) : other_cols
combine(groupby(df, nest_cols),
other_cols => ((args...) -> Ref(DataFrame(["x$i" => i for i ∈ args]))) => :X_DataFrame)
end
which works for arbitrary numbers of grouping columns and allows you to select the columns to be included in the nested DataFrame if you don’t want them all (of course then you can’t get back to the original df).
I did glance at that page but honestly reading R and its output is like reading russian (which I don’t).
groupby returns nicely grouped DataFrames which you can act on just like any other DataFrame. If you want to run models by group, which I think is what that page talks about, then you can loop or map over the grouped DataFrames. This will also be more efficient because these are just views of the parent, whereas in your example I would imagine you are basically creating a second copy of the data which is arranged slightly differently than the first.
The PR is not actively worked on since we are not sure if at all it is needed. Most of the cases are covered by groupby as @tbeason pointed. Can you please comment in the PR if you would find it useful? (we try to avoid “copying” functions from other packages if equivalent functionality is provided in a different way, but if something would we useful we are open to add it).
Thanks - I wasn’t aware of the discussion and PR! The discussion there looks good. Given I am new to Julia it is hard for me to see which functionality would already be there, but I can try to summarize which functionality would be great IMO. And user friendliness always is a good argument I think :-).
Of course it is, but one has to keep in mind that a small API surface with well thought out and powerful abstractions is also user friendly. Special casing all sorts of convenience things and making the package and documentation unwieldy in the process should be (and IMHO for Dataframes.jl has been) avoided.
For new personally the small API of DataFrames compared to pandas or tidyr/dplyr is a major upside (although granted it is also due to the fact that base Julia is fast so DataFrames doesn’t have to implement lots of special cases to deliver performance on common tasks).