Applying group selection conditions using groupeddataframes

I need to select groups of observations from a large dataframe (about 2.9 mio rows) using a number of conditions which apply to different observations in each group (so I cannot select on individual rows only).

Using a small dataframe as a starting point, I wrote an algorithm which applies the conditions and generates the result dataframe with my desired groups within a loop. See mwe below.

If I apply this loop to the large dataframe, performance becomes a (serious) problem.

I haven’t used the split/apply/combine approach before so I am learning about it right now. I worked through the documentation but I haven’t been able to write code for my problem yet.

For example, I am struggling to understand how to select different rows of groupeddataframes. I figured out how to get the age of the status1 row select(combine(first, gdf), :status => :obs1_status) but not for the status2 row.

Any hints/guidance on how to implement selection conditions on groupeddataframes using combine/select/transform commands? (I.e. generate df_result in a faster way?)

Thanks a lot!

using DataFrames

# generate sample dataframe
df = DataFrame(id = [1,1,1,2,2,3,4], age = [53,52,17,31,29,22,71], status = [1,2,3,1,2,1,1])

# initialize result dataframe
df_result = copy(df[1:2,:]; copycols=true);

for k = 1:maximum(df.id)

  df_temp = df[df.id .== k, :]

  # Keep only groups with members status == 1 and status == 2
  N = count(i->(i == 1),df_temp.status) + count(i->(i == 2),df_temp.status)
  if N != 2
      continue
  end

  # Keep only groups with working age status1
  stat1_age = df_temp[df_temp.status .== 1, :age]    
  if (stat1_age[1] > 61 || stat1_age[1] < 25)
      continue
  end

  append!(df_result,df_temp)

end

delete!(df_result,1:2);
using DataFrames

# generate sample dataframe
df = DataFrame(id = [1,1,1,2,2,3,4], age = [53,52,17,31,29,22,71], status = [1,2,3,1,2,1,1])


using Pipe

keep_ids_for_members12 = @pipe df |>
    groupby(_, :id) |>
    combine(_,
        :status => (status->sum(in(1:2), status)) =>:n_status12,
    ) |>
    filter(:n_status12 => ==(2), _) |>
    getindex(_, :id)

keep_ids = @pipe df |>
    filter(:status => ==(1), _) |>
    transform(_, :age => (age -> age .< 25 .| age .> 61) => :not_working_age) |>
    filter(:not_working_age => ==(0), _) |>
    getindex(_, :id)


keep_ids = intersect(keep_ids, keep_ids_for_members12)

df_result = @pipe df |>
    filter(:id => in(keep_ids), _)

Here is another solution using the latest and best macro-tools, in one fell swoop

using Pipe, PairAsPipe, DataFramesMeta
df_result = @pipe df |>
    groupby(_, :id) |>
    combine(_,
        @pap(status1and2 = sum(in(1:2), :status)),
        @pap(not_wokring_age = sum(:status .== 1 .& (:age .< 25 .| :age .> 61)))
    ) |>
    @where(_, :status1and2 .== 2, :not_wokring_age .== 0) |>
    @select(_, :id) |>
    innerjoin(_, df; on = :id)
1 Like

Thanks @xiaodai! I like the second solution because the structure is very clear. Unfortunately, I am experiencing unexpected behavior when I use it on the larger dataframe. Since I am not familiar with the syntax, I couldn’t figure out myself how to adjust it.

To illustrate, I just added two more qualifying groups to the earlier dataframe.

df2 = DataFrame(id = [1,1,1,2,2,3,4,5,5,6,6], age = [53,52,17,31,29,22,71,55,51,47,42], status = [1,2,3,1,2,1,1,1,2,1,2]);

using Pipe, PairAsPipe, DataFramesMeta
df_result_x = @pipe df2 |>
    groupby(_, :id) |>
    combine(_,
        @pap(status1and2 = sum(in(1:2), :status)),
        @pap(not_wokring_age = sum(:status .== 1 .& (:age .< 25 .| :age .> 61)))
    ) |>
    @where(_, :status1and2 .== 2, :not_wokring_age .== 0) |>
    @select(_, :id) |>
    innerjoin(_, df2; on = :id)

df_result_x

β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 31    β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 2      β”‚

I am wondering why the result does not include the group with id 5 and 6. Can you enlighten me? Thanks for helping with this.

innerjoin will only use the ids present in both the LHS and RHS. You probably want leftjoin.

Thanks! Just tried your suggestion - no effect unfortunately.

To check if the problem is related to the order of the groups I just created a different df in which I move the to-be-excluded groups into the last rows.

df22 = DataFrame(id = [1,1,1,2,2,5,5,6,6,3,4], age = [53,52,17,31,29,55,51,47,42,22,71], status = [1,2,3,1,2,1,2,1,2,1,1]);

The problem remains the same…

Maybe you want rightjoin?

That just reproduces the original df.

this condition doesn’t look. So you want to keep the group if the status 1 row is within 25 and 61 or outside?

Modify this line to the correct logic

Here’s a solution that uses groupby (but not strictly split/apply/combine) which takes ~2/3 of the time of your solution in the toy example. I am also using some optimizations on your logic:

function process1(df)
  # initialize result dataframe
  df_result = copy(df[1:2,:]; copycols=true);

  for k = 1:maximum(df.id)

    df_temp = df[df.id .== k, :]

    # Keep only groups with members status == 1 and status == 2
    N = count(i->(i == 1),df_temp.status) + count(i->(i == 2),df_temp.status)
    if N != 2
        continue
    end

    # Keep only groups with working age status1
    stat1_age = df_temp[df_temp.status .== 1, :age]
    if (stat1_age[1] > 61 || stat1_age[1] < 25)
        continue
    end

    append!(df_result,df_temp)

  end

  delete!(df_result,1:2);
end
function process2(df)
  # initialize result dataframe
  df_result = copy(df[1:2,:]; copycols=true)

  for group in groupby(df, :id)

    # Keep only groups with members status == 1 and status == 2
    if 1 ∈ group.status && 2 ∈ group.status # faster than counting
      stat1_age = group[findfirst(x -> x == 1, group.status), :age] # Only using the first occurrence
      if 25 ≀ stat1_age ≀ 61
        append!(df_result, group)
      end
    end

  end

  delete!(df_result, 1:2)

  return df_result
end
julia> @btime process1($df)
  9.877 ΞΌs (129 allocations: 11.33 KiB)
5Γ—3 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 31    β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 2      β”‚

julia> @btime process2($df)
  6.997 ΞΌs (101 allocations: 9.05 KiB)
5Γ—3 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 31    β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 2      β”‚

Comparing to xiaodai’s answers:

julia> @btime processx($df)
  68.657 ΞΌs (408 allocations: 28.48 KiB)
5Γ—3 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 31    β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 2      β”‚
julia> @btime processx2($df)
  81.531 ΞΌs (537 allocations: 37.75 KiB)
5Γ—3 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 31    β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 2      β”‚

I tried other solutions with filter!, but that is much slower.

1 Like

I want to keep the group if the status 1 row is within 25 to 61. (My code implements this condition correctly I think.)

Thanks a lot @kmundnic! process2 is helpful for me because it uses syntax I am familiar with and still improves performance.

Since I am new to groupedddataframes it would be helpful to get a snapshot on the current frontier regarding data selection. Specifically, is there currently an efficient way to

  • use logical indexing on rows of groupedddataframes? (I only found those indexing options.)
  • apply selection functions such as filter on groupedddataframes? (This question seems related.)

Thanks a lot to everyone!

study the code and you can fix it. you have all the tools now.

Thanks a lot for your inputs!

It’s not clear what kind of output you want. All the join logic is the exact same as R’s. Can you give an example of the data frame you would like after the join?

1 Like

Regarding your two indexing questions:

  • You can use logical indexing in groups, for example:
julia> groups = groupby(df, :id) # df from example
julia> group = groups[1]
julia> (group[:,:status] .== 1) .| (group[:,:status] .== 2)
3-element BitArray{1}:
 1
 1
 0

However, for your problem it seems unnecessary because you can check your conditions without creating an indexing array, since you only need to check that the values exist.

  • I tried out filter yesterday and it was slower. An example:
julia> filter(row -> row.status == 1 || row.status == 2, group)
2Γ—3 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 2      β”‚

Going back to your original question, I think the problem doesn’t quite fit in the split/apply/combine framework. From what I understand, you would like to:

  1. Group by id
  2. Filter the groups by some conditions (could be done with an embedded function in apply, returning empty DataFrames when appropriate)
  3. Vertically concatenate all results (in which joins don’t really apply).
1 Like

Thanks a lot for your explanation @kmundic! (And sorry for my late response, it’s been a hectic time for me.)

I understand now that my objective is different from the traditional split-apply-combine approach - I want to select subgroups as opposed to applying functions to them.

Some respondents pointed out that my initial problem description is rather vague. My apologies for this.

Here is a more comprehensive and precise exposition:

I have a dataframe with a large number (2.5 mio) of rows and a much smaller number of columns (8). Each row contains information on a particular individual (age, income, etc.). The individuals belong to different families (with sizes ranging from 1 to 12 individuals). Each family has a unique id which is assigned to its members (so the individual-level variables also include id and a variable which indicates the status of an individual within the family, e.g. husband, wife, child, etc).

My objective is to select the families (subgroups) which contain specific sets of individuals. The selection conditions can have two forms:

  1. they can refer to characteristics of a specific individual within the family (e.g. the age of the individual for which status == 1 (husband))

  2. they can refer to characteristics of several individuals within the family (e.g. the income of the individuals for which status == 1 and status == 2 (wife)).

Consider this dataframe (replication code below):

julia> df
11Γ—4 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ inc   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 53    β”‚ 5     β”‚ 1      β”‚
β”‚ 2   β”‚ 1     β”‚ 52    β”‚ 5     β”‚ 2      β”‚
β”‚ 3   β”‚ 1     β”‚ 17    β”‚ 0     β”‚ 3      β”‚
β”‚ 4   β”‚ 2     β”‚ 30    β”‚ 0     β”‚ 1      β”‚
β”‚ 5   β”‚ 2     β”‚ 29    β”‚ 20    β”‚ 2      β”‚
β”‚ 6   β”‚ 3     β”‚ 22    β”‚ 12    β”‚ 1      β”‚
β”‚ 7   β”‚ 4     β”‚ 61    β”‚ 15    β”‚ 1      β”‚
β”‚ 8   β”‚ 5     β”‚ 55    β”‚ 11    β”‚ 1      β”‚
β”‚ 9   β”‚ 5     β”‚ 51    β”‚ 0     β”‚ 2      β”‚
β”‚ 10  β”‚ 6     β”‚ 67    β”‚ 12    β”‚ 1      β”‚
β”‚ 11  β”‚ 6     β”‚ 62    β”‚ 12    β”‚ 2      β”‚

I want to keep families (indicated by id) which meet these conditions:

  1. husband (status == 1) is older than 30 but younger than 65
  2. contain a husband (status == 1) and a wife (status == 2)
  3. at least one of husband or wife has income larger than 10

Hence, I want to keep the family with id == 5 so my result dataframe looks like:

julia> df_result
2Γ—4 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ inc   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5     β”‚ 55    β”‚ 11    β”‚ 1      β”‚
β”‚ 2   β”‚ 5     β”‚ 51    β”‚ 0     β”‚ 2      β”‚

Given that I have a 2.5 mio individuals, I need the algorithm which produces df_result to be efficient. Ideally, it should allow to easily adjust the selection conditions (e.g. to include additional criteria).

@pdeffebach and @xiaodai I hope this explanation is clearer and more comprehensive than my earlier one. Thanks again for helping with this.

Replication of the above dataframes:

using DataFrames
df = DataFrame(id = [1,1,1,2,2,3,4,5,5,6,6], age = [53,52,17,30,29,22,61,55,51,67,62], inc = [5,5,0,0,20,12,15,11,0,12,12]; status = [1,2,3,1,2,1,1,1,2,1,2]);
df_result = DataFrame(id = [5,5], age = [55,51], inc = [11,0]; status = [1,2]);

Thanks for the MWE.

This is indeed a tough problem! And one we could improve on in two areas.

  1. We don’t have a filter for a GroupedDataFrame, yet. See an issue here. This is a perfect use-case.
  2. Indexing into individual rows is a bit awkward, so something like β€œkeep when the husband’s income high” is a bit tough because it’s hard to get the husband’s observation.

Nonetheless, here is a solution that gets around those two problems.

  1. We use combine to filter, by returning an empty data frame for groups we don’t want.
  2. I use some slightly awkward indexing and findfirst to get the observations for the husband and wife.

Here is a solution

julia> df_combine_result = combine(groupby(df, :id)) do sdf
        
       
       if (1 in sdf.status && 2 in sdf.status)
           # nothing
       else 
           return DataFrame()
       end
        
       husband = sdf[findfirst(==(1), sdf.status), :] # awkward, sorry
       wife = sdf[findfirst(==(2), sdf.status), :]
       
       if !(husband.age > 30 && husband.age < 65) 
           return DataFrame()
       end
       
       if !(husband.inc > 10 || wife.inc > 10)
           return DataFrame()
       end
       
       return sdf
       end
2Γ—4 DataFrame
β”‚ Row β”‚ id    β”‚ age   β”‚ inc   β”‚ status β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5     β”‚ 55    β”‚ 11    β”‚ 1      β”‚
β”‚ 2   β”‚ 5     β”‚ 51    β”‚ 0     β”‚ 2      β”‚

3 Likes

Thanks for sharing this solution!

Do you think it’s worth referencing this problem in the DataFrames Development Survey?