Thanks a lot for your explanation @kmundic! (And sorry for my late response, it’s been a hectic time for me.)
I understand now that my objective is different from the traditional split-apply-combine approach - I want to select subgroups as opposed to applying functions to them.
Some respondents pointed out that my initial problem description is rather vague. My apologies for this.
Here is a more comprehensive and precise exposition:
I have a dataframe with a large number (2.5 mio) of rows and a much smaller number of columns (8). Each row contains information on a particular individual (age, income, etc.). The individuals belong to different families (with sizes ranging from 1 to 12 individuals). Each family has a unique id which is assigned to its members (so the individual-level variables also include id
and a variable which indicates the status
of an individual within the family, e.g. husband, wife, child, etc).
My objective is to select the families (subgroups) which contain specific sets of individuals. The selection conditions can have two forms:
-
they can refer to characteristics of a specific individual within the family (e.g. the age of the individual for which status == 1 (husband))
-
they can refer to characteristics of several individuals within the family (e.g. the income of the individuals for which status == 1 and status == 2 (wife)).
Consider this dataframe (replication code below):
julia> df
11×4 DataFrame
│ Row │ id │ age │ inc │ status │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 1 │ 53 │ 5 │ 1 │
│ 2 │ 1 │ 52 │ 5 │ 2 │
│ 3 │ 1 │ 17 │ 0 │ 3 │
│ 4 │ 2 │ 30 │ 0 │ 1 │
│ 5 │ 2 │ 29 │ 20 │ 2 │
│ 6 │ 3 │ 22 │ 12 │ 1 │
│ 7 │ 4 │ 61 │ 15 │ 1 │
│ 8 │ 5 │ 55 │ 11 │ 1 │
│ 9 │ 5 │ 51 │ 0 │ 2 │
│ 10 │ 6 │ 67 │ 12 │ 1 │
│ 11 │ 6 │ 62 │ 12 │ 2 │
I want to keep families (indicated by id
) which meet these conditions:
- husband (status == 1) is older than 30 but younger than 65
- contain a husband (status == 1) and a wife (status == 2)
- at least one of husband or wife has income larger than 10
Hence, I want to keep the family with id == 5
so my result dataframe looks like:
julia> df_result
2×4 DataFrame
│ Row │ id │ age │ inc │ status │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 5 │ 55 │ 11 │ 1 │
│ 2 │ 5 │ 51 │ 0 │ 2 │
Given that I have a 2.5 mio individuals, I need the algorithm which produces df_result
to be efficient. Ideally, it should allow to easily adjust the selection conditions (e.g. to include additional criteria).
@pdeffebach and @xiaodai I hope this explanation is clearer and more comprehensive than my earlier one. Thanks again for helping with this.
Replication of the above dataframes:
using DataFrames
df = DataFrame(id = [1,1,1,2,2,3,4,5,5,6,6], age = [53,52,17,30,29,22,61,55,51,67,62], inc = [5,5,0,0,20,12,15,11,0,12,12]; status = [1,2,3,1,2,1,1,1,2,1,2]);
df_result = DataFrame(id = [5,5], age = [55,51], inc = [11,0]; status = [1,2]);