Nesting and unnesting DataFrames

Dear Julia community,
I wonder if one can reproduce the nest and unnest functionality of TidyR. I have seen the nice related blog Nesting and unnesting columns in DataFrames.jl | Blog by Bogumił Kamiński, which however ‘only’ seems to work over Tuples.

A MWE:

df = DataFrame(rand(100,3), :auto)
df.gr=repeat('A':'D'; inner=25)

## Create a simple nested DataFrame
df_nested = combine(groupby(df, :gr), [:x1, :x2, :x3] => ((x,y,z) ->  Ref(DataFrame(x1=x, x2=y, x3=z))) => :X_DataFrame)

julia> df_nested
4×2 DataFrame
Row │ gr    X_DataFrame    
    │ Char  DataFrame      
─────┼──────────────────────
  1 │ A     25×3 DataFrame 
  2 │ B     25×3 DataFrame 
  3 │ C     25×3 DataFrame 
  4 │ D     25×3 DataFrame 

Questions:

  1. How can I revert df_nested into df? I tried: transfrom(dfc, :X_DataFrame => AsTable) or combine(dfc, :X_DataFrame => AsTable) for instance, but get error keys(::DataFrame) does not exist.
  2. How can I create a more general function, which does the nesting instead of doing it by hand? Something like: nest(gdf:GroupedDataFrame, cols)

Sorry, I still don’t have the best intuition, of the great DataFrames package…

Getting df back is just a matter of vcating the individual DataFrames, the only unwieldy thing is to add back the group indicator:

[DataFrame(gr = reduce(vcat, fill.(df_nested.gr, nrow.(df_nested.X_DataFrame)))) reduce(vcat, df_nested.X_DataFrame)]

here DataFrame(gr = reduce(vcat, fill.(df_nested.gr, nrow.(df_nested.X_DataFrame)))) creates the vector of group identifiers, reduce(vcat, df_nested.X_DataFrame) creates the DataFrame, and then [x y] hcats them together.

As for generalising this, you can do:

function nest(df::DataFrame, nest_cols; other_cols = nothing)
      other_cols = isnothing(other_cols) ? names(df[!, Not(nest_cols)]) : other_cols
      combine(groupby(df, nest_cols),
            other_cols => ((args...) -> Ref(DataFrame(["x$i" => i for i ∈ args]))) => :X_DataFrame)
end 

which works for arbitrary numbers of grouping columns and allows you to select the columns to be included in the nested DataFrame if you don’t want them all (of course then you can’t get back to the original df).

1 Like

Thanks for the hints! Based on those I came up with the two (slightly more general) functions, certainly still not 100% robust and save:

import Chain: @chain as @c
function nest(df::DataFrame, group_cols; nest_cols = nothing)
    nest_cols = isnothing(nest_cols) ? names(df[!, Not(group_cols)]) : nest_cols
    combine(groupby(df, group_cols),
          nest_cols => ((args...) -> Ref(DataFrame([nest_cols[i] => a for (i,a) ∈ enumerate(args)]))) => :nested)
end

function unnest_dfcol(df::DataFrame, col; appendName=true)
    #[DataFrame(gr = reduce(vcat, fill.(df_nested.gr, nrow.(df_nested.X_DataFrame)))) reduce(vcat, df_nested.X_DataFrame)]
    df_gr = select(df, Not(col))
    gr_expand = @c [reduce(vcat, fill(DataFrame(g), nrow(n[col])))   for (g,n)  in zip(eachrow(df_gr), eachrow(df))] reduce(vcat, _)
    unnested = reduce(vcat, df[:,col] )
    appendName && rename!(x-> string(col) * "_" * x, unnested)

    #return hcat(gr_expand, unnested)
    return [gr_expand unnested]
end

At the end everything is possible in Julia quite nicely…

I’d love to see more of such (better programmed I guess) high-level functionality in DataFrames :slight_smile:

P.S.: And I was hoping/thinking that it could be done in a similar way as the nice blog post by @bkamins above!

I would avoid this style in the first place. In your example, I would not have the function I combine over return a DataFrame.

Hhmm, but then let me know how things discussed in 6.3 Nesting | R for data science: tidyverse and beyond would be possible in a similarly neatly structured way…

I did glance at that page but honestly reading R and its output is like reading russian (which I don’t).

groupby returns nicely grouped DataFrames which you can act on just like any other DataFrame. If you want to run models by group, which I think is what that page talks about, then you can loop or map over the grouped DataFrames. This will also be more efficient because these are just views of the parent, whereas in your example I would imagine you are basically creating a second copy of the data which is arranged slightly differently than the first.

2 Likes

We are currently discussing adding nest and unnest in https://github.com/JuliaData/DataFrames.jl/pull/3258 (you can have a look at the implementation there).

The PR is not actively worked on since we are not sure if at all it is needed. Most of the cases are covered by groupby as @tbeason pointed. Can you please comment in the PR if you would find it useful? (we try to avoid “copying” functions from other packages if equivalent functionality is provided in a different way, but if something would we useful we are open to add it).

3 Likes

Thanks - I wasn’t aware of the discussion and PR! The discussion there looks good. Given I am new to Julia it is hard for me to see which functionality would already be there, but I can try to summarize which functionality would be great IMO. And user friendliness always is a good argument I think :-).

Of course it is, but one has to keep in mind that a small API surface with well thought out and powerful abstractions is also user friendly. Special casing all sorts of convenience things and making the package and documentation unwieldy in the process should be (and IMHO for Dataframes.jl has been) avoided.

For new personally the small API of DataFrames compared to pandas or tidyr/dplyr is a major upside (although granted it is also due to the fact that base Julia is fast so DataFrames doesn’t have to implement lots of special cases to deliver performance on common tasks).

1 Like

nest unnest for homogeneous dataframes

grp=groupby(df, :gr)
k=first.(keys(grp))
dfs=[DataFrame(g) for g in grp]
ndf=DataFrame(;k,dfs)
vcat([combine(ndf,:dfs=>(x->x[r])=>AsTable) for r in 1:nrow(ndf)]...)

nest unnest for inhomogeneous dataframes

function expand2(ndf, col)
    df2nt(sdf)=(;zip(Symbol.(string.(col,'.',names(sdf))),eachcol(sdf))...) #cambiando i nomi in questo modo si evita di modificare il df originario
    rexp=findall(er-> er[col] isa DataFrame, eachrow(ndf))
    rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
    edf=hcat(ndf[rexp,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(row-> [c isa Vector ? length(c) : 1 for c in row],values.(eachrow(edf)))) # trova le righe dove ci sono le celle da espandere (ad esempio nomi, salari amici per ogni 'name' e anche hobbies amici per 'marge' )
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end


df1=DataFrame(grp[1][:,[2,4]])
df2=DataFrame(grp[2][:,[1,2]])
df3=DataFrame(grp[3][:,[2,3]])
df4=DataFrame(grp[4][:,[1,2,4]])
dfs=[df1,df2,df3,df4]
grp=groupby(df, :gr)
k=first.(keys(grp))
df_nest = ndf=DataFrame(;k,dfs)
expand2(df_nest,:dfs)