I need to select groups of observations from a large dataframe (about 2.9 mio rows) using a number of conditions which apply to different observations in each group (so I cannot select on individual rows only).
Using a small dataframe as a starting point, I wrote an algorithm which applies the conditions and generates the result dataframe with my desired groups within a loop. See mwe below.
If I apply this loop to the large dataframe, performance becomes a (serious) problem.
I havenβt used the split/apply/combine approach before so I am learning about it right now. I worked through the documentation but I havenβt been able to write code for my problem yet.
For example, I am struggling to understand how to select different rows of groupeddataframes. I figured out how to get the age of the status1 row select(combine(first, gdf), :status => :obs1_status)
but not for the status2 row.
Any hints/guidance on how to implement selection conditions on groupeddataframes using combine/select/transform commands? (I.e. generate df_result in a faster way?)
Thanks a lot!
using DataFrames
# generate sample dataframe
df = DataFrame(id = [1,1,1,2,2,3,4], age = [53,52,17,31,29,22,71], status = [1,2,3,1,2,1,1])
# initialize result dataframe
df_result = copy(df[1:2,:]; copycols=true);
for k = 1:maximum(df.id)
df_temp = df[df.id .== k, :]
# Keep only groups with members status == 1 and status == 2
N = count(i->(i == 1),df_temp.status) + count(i->(i == 2),df_temp.status)
if N != 2
continue
end
# Keep only groups with working age status1
stat1_age = df_temp[df_temp.status .== 1, :age]
if (stat1_age[1] > 61 || stat1_age[1] < 25)
continue
end
append!(df_result,df_temp)
end
delete!(df_result,1:2);