Applying group selection conditions using groupeddataframes

Here is another solution using the latest and best macro-tools, in one fell swoop

using Pipe, PairAsPipe, DataFramesMeta
df_result = @pipe df |>
    groupby(_, :id) |>
    combine(_,
        @pap(status1and2 = sum(in(1:2), :status)),
        @pap(not_wokring_age = sum(:status .== 1 .& (:age .< 25 .| :age .> 61)))
    ) |>
    @where(_, :status1and2 .== 2, :not_wokring_age .== 0) |>
    @select(_, :id) |>
    innerjoin(_, df; on = :id)
1 Like

Thanks @xiaodai! I like the second solution because the structure is very clear. Unfortunately, I am experiencing unexpected behavior when I use it on the larger dataframe. Since I am not familiar with the syntax, I couldn’t figure out myself how to adjust it.

To illustrate, I just added two more qualifying groups to the earlier dataframe.

df2 = DataFrame(id = [1,1,1,2,2,3,4,5,5,6,6], age = [53,52,17,31,29,22,71,55,51,47,42], status = [1,2,3,1,2,1,1,1,2,1,2]);

using Pipe, PairAsPipe, DataFramesMeta
df_result_x = @pipe df2 |>
    groupby(_, :id) |>
    combine(_,
        @pap(status1and2 = sum(in(1:2), :status)),
        @pap(not_wokring_age = sum(:status .== 1 .& (:age .< 25 .| :age .> 61)))
    ) |>
    @where(_, :status1and2 .== 2, :not_wokring_age .== 0) |>
    @select(_, :id) |>
    innerjoin(_, df2; on = :id)

df_result_x

│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │
│ 3   │ 1     │ 17    │ 3      │
│ 4   │ 2     │ 31    │ 1      │
│ 5   │ 2     │ 29    │ 2      │

I am wondering why the result does not include the group with id 5 and 6. Can you enlighten me? Thanks for helping with this.

innerjoin will only use the ids present in both the LHS and RHS. You probably want leftjoin.

Thanks! Just tried your suggestion - no effect unfortunately.

To check if the problem is related to the order of the groups I just created a different df in which I move the to-be-excluded groups into the last rows.

df22 = DataFrame(id = [1,1,1,2,2,5,5,6,6,3,4], age = [53,52,17,31,29,55,51,47,42,22,71], status = [1,2,3,1,2,1,2,1,2,1,1]);

The problem remains the same…

Maybe you want rightjoin?

That just reproduces the original df.

this condition doesn’t look. So you want to keep the group if the status 1 row is within 25 and 61 or outside?

Modify this line to the correct logic

Here’s a solution that uses groupby (but not strictly split/apply/combine) which takes ~2/3 of the time of your solution in the toy example. I am also using some optimizations on your logic:

function process1(df)
  # initialize result dataframe
  df_result = copy(df[1:2,:]; copycols=true);

  for k = 1:maximum(df.id)

    df_temp = df[df.id .== k, :]

    # Keep only groups with members status == 1 and status == 2
    N = count(i->(i == 1),df_temp.status) + count(i->(i == 2),df_temp.status)
    if N != 2
        continue
    end

    # Keep only groups with working age status1
    stat1_age = df_temp[df_temp.status .== 1, :age]
    if (stat1_age[1] > 61 || stat1_age[1] < 25)
        continue
    end

    append!(df_result,df_temp)

  end

  delete!(df_result,1:2);
end
function process2(df)
  # initialize result dataframe
  df_result = copy(df[1:2,:]; copycols=true)

  for group in groupby(df, :id)

    # Keep only groups with members status == 1 and status == 2
    if 1 ∈ group.status && 2 ∈ group.status # faster than counting
      stat1_age = group[findfirst(x -> x == 1, group.status), :age] # Only using the first occurrence
      if 25 ≤ stat1_age ≤ 61
        append!(df_result, group)
      end
    end

  end

  delete!(df_result, 1:2)

  return df_result
end
julia> @btime process1($df)
  9.877 μs (129 allocations: 11.33 KiB)
5×3 DataFrame
│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │
│ 3   │ 1     │ 17    │ 3      │
│ 4   │ 2     │ 31    │ 1      │
│ 5   │ 2     │ 29    │ 2      │

julia> @btime process2($df)
  6.997 μs (101 allocations: 9.05 KiB)
5×3 DataFrame
│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │
│ 3   │ 1     │ 17    │ 3      │
│ 4   │ 2     │ 31    │ 1      │
│ 5   │ 2     │ 29    │ 2      │

Comparing to xiaodai’s answers:

julia> @btime processx($df)
  68.657 μs (408 allocations: 28.48 KiB)
5×3 DataFrame
│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │
│ 3   │ 1     │ 17    │ 3      │
│ 4   │ 2     │ 31    │ 1      │
│ 5   │ 2     │ 29    │ 2      │
julia> @btime processx2($df)
  81.531 μs (537 allocations: 37.75 KiB)
5×3 DataFrame
│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │
│ 3   │ 1     │ 17    │ 3      │
│ 4   │ 2     │ 31    │ 1      │
│ 5   │ 2     │ 29    │ 2      │

I tried other solutions with filter!, but that is much slower.

1 Like

I want to keep the group if the status 1 row is within 25 to 61. (My code implements this condition correctly I think.)

Thanks a lot @kmundnic! process2 is helpful for me because it uses syntax I am familiar with and still improves performance.

Since I am new to groupedddataframes it would be helpful to get a snapshot on the current frontier regarding data selection. Specifically, is there currently an efficient way to

  • use logical indexing on rows of groupedddataframes? (I only found those indexing options.)
  • apply selection functions such as filter on groupedddataframes? (This question seems related.)

Thanks a lot to everyone!

study the code and you can fix it. you have all the tools now.

Thanks a lot for your inputs!

It’s not clear what kind of output you want. All the join logic is the exact same as R’s. Can you give an example of the data frame you would like after the join?

1 Like

Regarding your two indexing questions:

  • You can use logical indexing in groups, for example:
julia> groups = groupby(df, :id) # df from example
julia> group = groups[1]
julia> (group[:,:status] .== 1) .| (group[:,:status] .== 2)
3-element BitArray{1}:
 1
 1
 0

However, for your problem it seems unnecessary because you can check your conditions without creating an indexing array, since you only need to check that the values exist.

  • I tried out filter yesterday and it was slower. An example:
julia> filter(row -> row.status == 1 || row.status == 2, group)
2×3 DataFrame
│ Row │ id    │ age   │ status │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 1      │
│ 2   │ 1     │ 52    │ 2      │

Going back to your original question, I think the problem doesn’t quite fit in the split/apply/combine framework. From what I understand, you would like to:

  1. Group by id
  2. Filter the groups by some conditions (could be done with an embedded function in apply, returning empty DataFrames when appropriate)
  3. Vertically concatenate all results (in which joins don’t really apply).
1 Like

Thanks a lot for your explanation @kmundic! (And sorry for my late response, it’s been a hectic time for me.)

I understand now that my objective is different from the traditional split-apply-combine approach - I want to select subgroups as opposed to applying functions to them.

Some respondents pointed out that my initial problem description is rather vague. My apologies for this.

Here is a more comprehensive and precise exposition:

I have a dataframe with a large number (2.5 mio) of rows and a much smaller number of columns (8). Each row contains information on a particular individual (age, income, etc.). The individuals belong to different families (with sizes ranging from 1 to 12 individuals). Each family has a unique id which is assigned to its members (so the individual-level variables also include id and a variable which indicates the status of an individual within the family, e.g. husband, wife, child, etc).

My objective is to select the families (subgroups) which contain specific sets of individuals. The selection conditions can have two forms:

  1. they can refer to characteristics of a specific individual within the family (e.g. the age of the individual for which status == 1 (husband))

  2. they can refer to characteristics of several individuals within the family (e.g. the income of the individuals for which status == 1 and status == 2 (wife)).

Consider this dataframe (replication code below):

julia> df
11×4 DataFrame
│ Row │ id    │ age   │ inc   │ status │
│     │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │
│ 2   │ 1     │ 52    │ 5     │ 2      │
│ 3   │ 1     │ 17    │ 0     │ 3      │
│ 4   │ 2     │ 30    │ 0     │ 1      │
│ 5   │ 2     │ 29    │ 20    │ 2      │
│ 6   │ 3     │ 22    │ 12    │ 1      │
│ 7   │ 4     │ 61    │ 15    │ 1      │
│ 8   │ 5     │ 55    │ 11    │ 1      │
│ 9   │ 5     │ 51    │ 0     │ 2      │
│ 10  │ 6     │ 67    │ 12    │ 1      │
│ 11  │ 6     │ 62    │ 12    │ 2      │

I want to keep families (indicated by id) which meet these conditions:

  1. husband (status == 1) is older than 30 but younger than 65
  2. contain a husband (status == 1) and a wife (status == 2)
  3. at least one of husband or wife has income larger than 10

Hence, I want to keep the family with id == 5 so my result dataframe looks like:

julia> df_result
2×4 DataFrame
│ Row │ id    │ age   │ inc   │ status │
│     │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼───────┼────────┤
│ 1   │ 5     │ 55    │ 11    │ 1      │
│ 2   │ 5     │ 51    │ 0     │ 2      │

Given that I have a 2.5 mio individuals, I need the algorithm which produces df_result to be efficient. Ideally, it should allow to easily adjust the selection conditions (e.g. to include additional criteria).

@pdeffebach and @xiaodai I hope this explanation is clearer and more comprehensive than my earlier one. Thanks again for helping with this.

Replication of the above dataframes:

using DataFrames
df = DataFrame(id = [1,1,1,2,2,3,4,5,5,6,6], age = [53,52,17,30,29,22,61,55,51,67,62], inc = [5,5,0,0,20,12,15,11,0,12,12]; status = [1,2,3,1,2,1,1,1,2,1,2]);
df_result = DataFrame(id = [5,5], age = [55,51], inc = [11,0]; status = [1,2]);

Thanks for the MWE.

This is indeed a tough problem! And one we could improve on in two areas.

  1. We don’t have a filter for a GroupedDataFrame, yet. See an issue here. This is a perfect use-case.
  2. Indexing into individual rows is a bit awkward, so something like “keep when the husband’s income high” is a bit tough because it’s hard to get the husband’s observation.

Nonetheless, here is a solution that gets around those two problems.

  1. We use combine to filter, by returning an empty data frame for groups we don’t want.
  2. I use some slightly awkward indexing and findfirst to get the observations for the husband and wife.

Here is a solution

julia> df_combine_result = combine(groupby(df, :id)) do sdf
        
       
       if (1 in sdf.status && 2 in sdf.status)
           # nothing
       else 
           return DataFrame()
       end
        
       husband = sdf[findfirst(==(1), sdf.status), :] # awkward, sorry
       wife = sdf[findfirst(==(2), sdf.status), :]
       
       if !(husband.age > 30 && husband.age < 65) 
           return DataFrame()
       end
       
       if !(husband.inc > 10 || wife.inc > 10)
           return DataFrame()
       end
       
       return sdf
       end
2×4 DataFrame
│ Row │ id    │ age   │ inc   │ status │
│     │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼───────┼────────┤
│ 1   │ 5     │ 55    │ 11    │ 1      │
│ 2   │ 5     │ 51    │ 0     │ 2      │

3 Likes

Thanks for sharing this solution!

Do you think it’s worth referencing this problem in the DataFrames Development Survey?

We are aware of these issues but its always helpful to formalize feedback!

This is a little bit off-topic since the original question was about performance. But here’s how I would solve the problem using the Douglass.jl interface to DataFrames, which may be appealing to economists and those familiar with Stata:

> df
11×4 DataFrame
│ Row │ id    │ age   │ inc   │ status │
│     │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼───────┼────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │
│ 2   │ 1     │ 52    │ 5     │ 2      │
│ 3   │ 1     │ 17    │ 0     │ 3      │
│ 4   │ 2     │ 30    │ 0     │ 1      │
│ 5   │ 2     │ 29    │ 20    │ 2      │
│ 6   │ 3     │ 22    │ 12    │ 1      │
│ 7   │ 4     │ 61    │ 15    │ 1      │
│ 8   │ 5     │ 55    │ 11    │ 1      │
│ 9   │ 5     │ 51    │ 0     │ 2      │
│ 10  │ 6     │ 67    │ 12    │ 1      │
│ 11  │ 6     │ 62    │ 12    │ 2      │

Construct the husband’s age and a group-level dummy that captures whether the husband age requirement is satisfied.

Douglass> gen :husband_age = :age if :status == 1
11×5 DataFrame
│ Row │ id    │ age   │ inc   │ status │ husband_age │
│     │ Int64 │ Int64 │ Int64 │ Int64  │ Int64?      │
├─────┼───────┼───────┼───────┼────────┼─────────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │ 53          │
│ 2   │ 1     │ 52    │ 5     │ 2      │ missing     │
│ 3   │ 1     │ 17    │ 0     │ 3      │ missing     │
│ 4   │ 2     │ 30    │ 0     │ 1      │ 30          │
│ 5   │ 2     │ 29    │ 20    │ 2      │ missing     │
│ 6   │ 3     │ 22    │ 12    │ 1      │ 22          │
│ 7   │ 4     │ 61    │ 15    │ 1      │ 61          │
│ 8   │ 5     │ 55    │ 11    │ 1      │ 55          │
│ 9   │ 5     │ 51    │ 0     │ 2      │ missing     │
│ 10  │ 6     │ 67    │ 12    │ 1      │ 67          │
│ 11  │ 6     │ 62    │ 12    │ 2      │ missing     │
Douglass> bysort :id (:status): egen :husband_age_requirement = (mean(skipmissing(:husband_age)) > 30) & (mean(skipmissing(:husband_age)) < 65)
11×6 DataFrame
│ Row │ id    │ age   │ inc   │ status │ husband_age │ husband_age_requirement │
│     │ Int64 │ Int64 │ Int64 │ Int64  │ Int64?      │ Bool                    │
├─────┼───────┼───────┼───────┼────────┼─────────────┼─────────────────────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │ 53          │ 1                       │
│ 2   │ 1     │ 52    │ 5     │ 2      │ missing     │ 1                       │
│ 3   │ 1     │ 17    │ 0     │ 3      │ missing     │ 1                       │
│ 4   │ 2     │ 30    │ 0     │ 1      │ 30          │ 0                       │
│ 5   │ 2     │ 29    │ 20    │ 2      │ missing     │ 0                       │
│ 6   │ 3     │ 22    │ 12    │ 1      │ 22          │ 0                       │
│ 7   │ 4     │ 61    │ 15    │ 1      │ 61          │ 1                       │
│ 8   │ 5     │ 55    │ 11    │ 1      │ 55          │ 1                       │
│ 9   │ 5     │ 51    │ 0     │ 2      │ missing     │ 1                       │
│ 10  │ 6     │ 67    │ 12    │ 1      │ 67          │ 0                       │
│ 11  │ 6     │ 62    │ 12    │ 2      │ missing     │ 0                       │

The husband-and-wife condition. Note that we can use any vector-valued Julia function on the right-hand-side of the assignment operation (here, any).

Douglass> bysort :id (:status): egen :husband_and_wife = any(:status .== 1) & any(:status .== 2)
11×7 DataFrame
│ Row │ id    │ age   │ inc   │ status │ husband_age │ husband_age_requirement │ husband_and_wife │
│     │ Int64 │ Int64 │ Int64 │ Int64  │ Int64?      │ Bool                    │ Bool             │
├─────┼───────┼───────┼───────┼────────┼─────────────┼─────────────────────────┼──────────────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │ 53          │ 1                       │ 1                │
│ 2   │ 1     │ 52    │ 5     │ 2      │ missing     │ 1                       │ 1                │
│ 3   │ 1     │ 17    │ 0     │ 3      │ missing     │ 1                       │ 1                │
│ 4   │ 2     │ 30    │ 0     │ 1      │ 30          │ 0                       │ 1                │
│ 5   │ 2     │ 29    │ 20    │ 2      │ missing     │ 0                       │ 1                │
│ 6   │ 3     │ 22    │ 12    │ 1      │ 22          │ 0                       │ 0                │
│ 7   │ 4     │ 61    │ 15    │ 1      │ 61          │ 1                       │ 0                │
│ 8   │ 5     │ 55    │ 11    │ 1      │ 55          │ 1                       │ 1                │
│ 9   │ 5     │ 51    │ 0     │ 2      │ missing     │ 1                       │ 1                │
│ 10  │ 6     │ 67    │ 12    │ 1      │ 67          │ 0                       │ 1                │
│ 11  │ 6     │ 62    │ 12    │ 2      │ missing     │ 0                       │ 1                │

Finally, the income requirement. This can be done in different ways, e.g.

Douglass> bysort :id (:status): egen :income_requirement = any((:status .== 1) .& (:inc .> 10)) | any((:status .== 2) .& (:inc .> 10))  
11×8 DataFrame
│ Row │ id    │ age   │ inc   │ status │ husband_age │ husband_age_requirement │ husband_and_wife │ income_requirement │
│     │ Int64 │ Int64 │ Int64 │ Int64  │ Int64?      │ Bool                    │ Bool             │ Bool               │
├─────┼───────┼───────┼───────┼────────┼─────────────┼─────────────────────────┼──────────────────┼────────────────────┤
│ 1   │ 1     │ 53    │ 5     │ 1      │ 53          │ 1                       │ 1                │ 0                  │
│ 2   │ 1     │ 52    │ 5     │ 2      │ missing     │ 1                       │ 1                │ 0                  │
│ 3   │ 1     │ 17    │ 0     │ 3      │ missing     │ 1                       │ 1                │ 0                  │
│ 4   │ 2     │ 30    │ 0     │ 1      │ 30          │ 0                       │ 1                │ 1                  │
│ 5   │ 2     │ 29    │ 20    │ 2      │ missing     │ 0                       │ 1                │ 1                  │
│ 6   │ 3     │ 22    │ 12    │ 1      │ 22          │ 0                       │ 0                │ 1                  │
│ 7   │ 4     │ 61    │ 15    │ 1      │ 61          │ 1                       │ 0                │ 1                  │
│ 8   │ 5     │ 55    │ 11    │ 1      │ 55          │ 1                       │ 1                │ 1                  │
│ 9   │ 5     │ 51    │ 0     │ 2      │ missing     │ 1                       │ 1                │ 1                  │
│ 10  │ 6     │ 67    │ 12    │ 1      │ 67          │ 0                       │ 1                │ 1                  │
│ 11  │ 6     │ 62    │ 12    │ 2      │ missing     │ 0                       │ 1                │ 1                  │

Now keep only those observations that satisfy all three requirements.

Douglass> keep if :husband_age_requirement & :husband_and_wife & :income_requirement
2×8 DataFrame
│ Row │ id    │ age   │ inc   │ status │ husband_age │ husband_age_requirement │ husband_and_wife │ income_requirement │
│     │ Int64 │ Int64 │ Int64 │ Int64  │ Int64?      │ Bool                    │ Bool             │ Bool               │
├─────┼───────┼───────┼───────┼────────┼─────────────┼─────────────────────────┼──────────────────┼────────────────────┤
│ 1   │ 5     │ 55    │ 11    │ 1      │ 55          │ 1                       │ 1                │ 1                  │
│ 2   │ 5     │ 51    │ 0     │ 2      │ missing     │ 1                       │ 1                │ 1                  │

I should say that while I’ve tried to implement these functions using DataFrames combine and transform, performance so far has not been my main concern, and I haven’t done a lot of benchmarking yet. Still, I would expect this to not take very long on 2.5m observations.