We are aware of these issues but its always helpful to formalize feedback!
This is a little bit off-topic since the original question was about performance. But hereβs how I would solve the problem using the Douglass.jl interface to DataFrames, which may be appealing to economists and those familiar with Stata:
> df
11Γ4 DataFrame
β Row β id β age β inc β status β
β β Int64 β Int64 β Int64 β Int64 β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 1 β 1 β 53 β 5 β 1 β
β 2 β 1 β 52 β 5 β 2 β
β 3 β 1 β 17 β 0 β 3 β
β 4 β 2 β 30 β 0 β 1 β
β 5 β 2 β 29 β 20 β 2 β
β 6 β 3 β 22 β 12 β 1 β
β 7 β 4 β 61 β 15 β 1 β
β 8 β 5 β 55 β 11 β 1 β
β 9 β 5 β 51 β 0 β 2 β
β 10 β 6 β 67 β 12 β 1 β
β 11 β 6 β 62 β 12 β 2 β
Construct the husbandβs age and a group-level dummy that captures whether the husband age requirement is satisfied.
Douglass> gen :husband_age = :age if :status == 1
11Γ5 DataFrame
β Row β id β age β inc β status β husband_age β
β β Int64 β Int64 β Int64 β Int64 β Int64? β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββββββ€
β 1 β 1 β 53 β 5 β 1 β 53 β
β 2 β 1 β 52 β 5 β 2 β missing β
β 3 β 1 β 17 β 0 β 3 β missing β
β 4 β 2 β 30 β 0 β 1 β 30 β
β 5 β 2 β 29 β 20 β 2 β missing β
β 6 β 3 β 22 β 12 β 1 β 22 β
β 7 β 4 β 61 β 15 β 1 β 61 β
β 8 β 5 β 55 β 11 β 1 β 55 β
β 9 β 5 β 51 β 0 β 2 β missing β
β 10 β 6 β 67 β 12 β 1 β 67 β
β 11 β 6 β 62 β 12 β 2 β missing β
Douglass> bysort :id (:status): egen :husband_age_requirement = (mean(skipmissing(:husband_age)) > 30) & (mean(skipmissing(:husband_age)) < 65)
11Γ6 DataFrame
β Row β id β age β inc β status β husband_age β husband_age_requirement β
β β Int64 β Int64 β Int64 β Int64 β Int64? β Bool β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββββββΌββββββββββββββββββββββββββ€
β 1 β 1 β 53 β 5 β 1 β 53 β 1 β
β 2 β 1 β 52 β 5 β 2 β missing β 1 β
β 3 β 1 β 17 β 0 β 3 β missing β 1 β
β 4 β 2 β 30 β 0 β 1 β 30 β 0 β
β 5 β 2 β 29 β 20 β 2 β missing β 0 β
β 6 β 3 β 22 β 12 β 1 β 22 β 0 β
β 7 β 4 β 61 β 15 β 1 β 61 β 1 β
β 8 β 5 β 55 β 11 β 1 β 55 β 1 β
β 9 β 5 β 51 β 0 β 2 β missing β 1 β
β 10 β 6 β 67 β 12 β 1 β 67 β 0 β
β 11 β 6 β 62 β 12 β 2 β missing β 0 β
The husband-and-wife condition. Note that we can use any vector-valued Julia function on the right-hand-side of the assignment operation (here, any
).
Douglass> bysort :id (:status): egen :husband_and_wife = any(:status .== 1) & any(:status .== 2)
11Γ7 DataFrame
β Row β id β age β inc β status β husband_age β husband_age_requirement β husband_and_wife β
β β Int64 β Int64 β Int64 β Int64 β Int64? β Bool β Bool β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββ€
β 1 β 1 β 53 β 5 β 1 β 53 β 1 β 1 β
β 2 β 1 β 52 β 5 β 2 β missing β 1 β 1 β
β 3 β 1 β 17 β 0 β 3 β missing β 1 β 1 β
β 4 β 2 β 30 β 0 β 1 β 30 β 0 β 1 β
β 5 β 2 β 29 β 20 β 2 β missing β 0 β 1 β
β 6 β 3 β 22 β 12 β 1 β 22 β 0 β 0 β
β 7 β 4 β 61 β 15 β 1 β 61 β 1 β 0 β
β 8 β 5 β 55 β 11 β 1 β 55 β 1 β 1 β
β 9 β 5 β 51 β 0 β 2 β missing β 1 β 1 β
β 10 β 6 β 67 β 12 β 1 β 67 β 0 β 1 β
β 11 β 6 β 62 β 12 β 2 β missing β 0 β 1 β
Finally, the income requirement. This can be done in different ways, e.g.
Douglass> bysort :id (:status): egen :income_requirement = any((:status .== 1) .& (:inc .> 10)) | any((:status .== 2) .& (:inc .> 10))
11Γ8 DataFrame
β Row β id β age β inc β status β husband_age β husband_age_requirement β husband_and_wife β income_requirement β
β β Int64 β Int64 β Int64 β Int64 β Int64? β Bool β Bool β Bool β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββ€
β 1 β 1 β 53 β 5 β 1 β 53 β 1 β 1 β 0 β
β 2 β 1 β 52 β 5 β 2 β missing β 1 β 1 β 0 β
β 3 β 1 β 17 β 0 β 3 β missing β 1 β 1 β 0 β
β 4 β 2 β 30 β 0 β 1 β 30 β 0 β 1 β 1 β
β 5 β 2 β 29 β 20 β 2 β missing β 0 β 1 β 1 β
β 6 β 3 β 22 β 12 β 1 β 22 β 0 β 0 β 1 β
β 7 β 4 β 61 β 15 β 1 β 61 β 1 β 0 β 1 β
β 8 β 5 β 55 β 11 β 1 β 55 β 1 β 1 β 1 β
β 9 β 5 β 51 β 0 β 2 β missing β 1 β 1 β 1 β
β 10 β 6 β 67 β 12 β 1 β 67 β 0 β 1 β 1 β
β 11 β 6 β 62 β 12 β 2 β missing β 0 β 1 β 1 β
Now keep only those observations that satisfy all three requirements.
Douglass> keep if :husband_age_requirement & :husband_and_wife & :income_requirement
2Γ8 DataFrame
β Row β id β age β inc β status β husband_age β husband_age_requirement β husband_and_wife β income_requirement β
β β Int64 β Int64 β Int64 β Int64 β Int64? β Bool β Bool β Bool β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββ€
β 1 β 5 β 55 β 11 β 1 β 55 β 1 β 1 β 1 β
β 2 β 5 β 51 β 0 β 2 β missing β 1 β 1 β 1 β
I should say that while Iβve tried to implement these functions using DataFrames combine
and transform
, performance so far has not been my main concern, and I havenβt done a lot of benchmarking yet. Still, I would expect this to not take very long on 2.5m observations.