Drop incomplete groups from a DataFrame

After removing some rows from a dataframe, I would like to keep only groups (grouping on a variable) which have “complete” observations (households always have 2 members in this data in the dataset I start with, so for any fewer I can consider that household incomplete).

The following MWE works, is there a more idiomatic way?

julia> using DataFrames

julia> df = DataFrame(household = [100, 100, 101, 101],
       person = [1, 2, 1, 2],
       wage = 1:4)
4×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 1      │ 1     │
│ 2   │ 100       │ 2      │ 2     │
│ 3   │ 101       │ 1      │ 3     │
│ 4   │ 101       │ 2      │ 4     │

julia> df = df[df.wage .≥ 2, :]
3×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 2      │ 2     │
│ 2   │ 101       │ 1      │ 3     │
│ 3   │ 101       │ 2      │ 4     │

julia> combine(sdf -> size(sdf, 1) == 2 ? sdf : DataFrame(),
       groupby(df, :household))
2×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 101       │ 1      │ 3     │
│ 2   │ 101       │ 2      │ 4     │

That’s what I would have done.

You can iterate through sub data frames in a GroupedDataFrame but it’s less elegant imo.

julia> df = DataFrame(a = [1, 1, 2, 2, 3], b = rand(5));

julia> gd = groupby(df, :a);

julia> to_keep = [nrow(sdf) == 2 for sdf in gd];

julia> DataFrame(gd[to_keep])
4×2 DataFrame
│ Row │ a     │ b         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.0124186 │
│ 2   │ 1     │ 0.294827  │
│ 3   │ 2     │ 0.335624  │
│ 4   │ 2     │ 0.0368225 │
1 Like

Not sure if it’s idiomatic, but sometimes I think it’s helpful to add the count as an extra column. You can do so by calling transform! to the grouped data. The parent, ungrouped, dataframe is updated in-place.

julia> df = DataFrame(household = [100, 100, 101, 101],
       person = [1, 2, 1, 2],
       wage = 1:4)
4×3 DataFrame
│ Row │ household │ person │ wage  │
│     │ Int64     │ Int64  │ Int64 │
├─────┼───────────┼────────┼───────┤
│ 1   │ 100       │ 1      │ 1     │
│ 2   │ 100       │ 2      │ 2     │
│ 3   │ 101       │ 1      │ 3     │
│ 4   │ 101       │ 2      │ 4     │

julia> transform!(groupby(df, :household), :household => length)
4×4 DataFrame
│ Row │ household │ person │ wage  │ household_length │
│     │ Int64     │ Int64  │ Int64 │ Int64            │
├─────┼───────────┼────────┼───────┼──────────────────┤
│ 1   │ 100       │ 1      │ 1     │ 2                │
│ 2   │ 100       │ 2      │ 2     │ 2                │
│ 3   │ 101       │ 1      │ 3     │ 2                │
│ 4   │ 101       │ 2      │ 4     │ 2                │
1 Like

Note that nrow is special-cased. So you can do

julia> transform!(groupby(df, :household), nrow)
2 Likes

Thanks for all the answers. “Computing in the table”, which is commonly used in eg Stata, is a style I would particularly like to avoid because I find that it is frequently a source of bugs. I am very happy that DataFrames supports a functional style, and I was just wondering if I am doing the right thing.

There is also filter that will be added in the next release (see https://github.com/JuliaData/DataFrames.jl/pull/2279).

3 Likes