using DataFrames
using StatsBase
df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))
that results in a DataFrame with 4 columns: id, s, t, and c. For some ids, there are situations where c > t for both values of s. I want to select only those cases, but I can’t figure out how to do it.
I have tried this:
gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)
but that also selects cases for when c > t for only one value of s. Any suggestions?
EDIT: Found a way:
insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)
a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.
df2 = filter([:t, :c] => (t, c) -> c > t, df)
filter(x->x.nrow==2,combine(groupby(df2, :id),:,nrow))
If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by :id, then you can do:
function testfun(t, c)
@assert length(t) == length(c)
@assert iseven(length(t))
@assert firstindex(t) == firstindex(c) == 1
cond = fill(true, length(t))
@inbounds for i in 1:2:length(t)
if t[i] >= c[i] || t[i+1] >= c[i+1]
cond[i] = false
cond[i+1] = false
end
end
return cond
end
Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.