Drop incomplete groups from a DataFrame

Tamas_Papp · July 15, 2020, 3:08pm

After removing some rows from a dataframe, I would like to keep only groups (grouping on a variable) which have “complete” observations (households always have 2 members in this data in the dataset I start with, so for any fewer I can consider that household incomplete).

The following MWE works, is there a more idiomatic way?

julia> using DataFrames

julia> df = DataFrame(household = [100, 100, 101, 101],
       person = [1, 2, 1, 2],
       wage = 1:4)
4×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 1      │ 1     │
│ 2   │ 100       │ 2      │ 2     │
│ 3   │ 101       │ 1      │ 3     │
│ 4   │ 101       │ 2      │ 4     │

julia> df = df[df.wage .≥ 2, :]
3×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 2      │ 2     │
│ 2   │ 101       │ 1      │ 3     │
│ 3   │ 101       │ 2      │ 4     │

julia> combine(sdf -> size(sdf, 1) == 2 ? sdf : DataFrame(),
       groupby(df, :household))
2×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 101       │ 1      │ 3     │
│ 2   │ 101       │ 2      │ 4     │

pdeffebach · July 15, 2020, 3:19pm

That’s what I would have done.

You can iterate through sub data frames in a GroupedDataFrame but it’s less elegant imo.

julia> df = DataFrame(a = [1, 1, 2, 2, 3], b = rand(5));

julia> gd = groupby(df, :a);

julia> to_keep = [nrow(sdf) == 2 for sdf in gd];

julia> DataFrame(gd[to_keep])
4×2 DataFrame
│ Row │ a     │ b         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.0124186 │
│ 2   │ 1     │ 0.294827  │
│ 3   │ 2     │ 0.335624  │
│ 4   │ 2     │ 0.0368225 │

piever · July 15, 2020, 5:09pm

Not sure if it’s idiomatic, but sometimes I think it’s helpful to add the count as an extra column. You can do so by calling transform! to the grouped data. The parent, ungrouped, dataframe is updated in-place.

julia> df = DataFrame(household = [100, 100, 101, 101],
       person = [1, 2, 1, 2],
       wage = 1:4)
4×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 1      │ 1     │
│ 2   │ 100       │ 2      │ 2     │
│ 3   │ 101       │ 1      │ 3     │
│ 4   │ 101       │ 2      │ 4     │

julia> transform!(groupby(df, :household), :household => length)
4×4 DataFrame
│ Row │ household │ person │ wage  │ household_length │
│     │ Int64     │ Int64  │ Int64 │ Int64            │
├─────┼───────────┼────────┼───────┼──────────────────┤
│ 1   │ 100       │ 1      │ 1     │ 2                │
│ 2   │ 100       │ 2      │ 2     │ 2                │
│ 3   │ 101       │ 1      │ 3     │ 2                │
│ 4   │ 101       │ 2      │ 4     │ 2                │

pdeffebach · July 15, 2020, 5:26pm

Note that nrow is special-cased. So you can do

julia> transform!(groupby(df, :household), nrow)

Tamas_Papp · July 16, 2020, 8:40am

Thanks for all the answers. “Computing in the table”, which is commonly used in eg Stata, is a style I would particularly like to avoid because I find that it is frequently a source of bugs. I am very happy that DataFrames supports a functional style, and I was just wondering if I am doing the right thing.

bkamins · July 16, 2020, 8:48am

There is also filter that will be added in the next release (see https://github.com/JuliaData/DataFrames.jl/pull/2279).

Topic		Replies	Views
How do I drop only rows that are fully filled with missing values? General Usage question , package , dataframes	3	175	January 24, 2023
Dataframe delete duplicate with condition New to Julia dataframes	2	2319	September 25, 2019
How to correct the contents of GroupedDataFrame to update it? New to Julia question	7	246	September 8, 2022
Filter doesn't work on grouped dataframe General Usage dataframes	5	1545	February 4, 2022
DataFramesMeta custom filter: by groups of A, apply filter on B Data	4	1410	May 27, 2019

Drop incomplete groups from a DataFrame

Related topics