I have this data frame that I want to filter:
df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))
that results in a
DataFrame with 4 columns:
c. For some
ids, there are situations where
c > t for both values of
s. I want to select only those cases, but I can’t figure out how to do it.
I have tried this:
gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)
but that also selects cases for when
c > t for only one value of
s. Any suggestions?
EDIT: Found a way:
insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)
Is there a more elegant way?
I assume you want something simple. Then you can do this:
DataFrame(filter(sdf -> all(sdf.c .> sdf.t), groupby(df, :id)))
(this is not the most efficient way to do it, but I believe the code is easiest to follow)
a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.
df2 = filter([:t, :c] => (t, c) -> c > t, df)
Can you give some pointers about a more efficient way to do it?
try this, waiting for better one
DataFrame(reduce(vcat, [[f,s] for (f,s) in zip(odd,even) if f.c > f.t && s.c > s.t]))
This seems like unnecessarily obtuse code, fwiw.
A “more efficient way to do it” would be to se the Pair syntax for
filter rather than an anonymous function which applies to each row.
If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by
:id, then you can do:
function testfun(t, c)
@assert length(t) == length(c)
@assert firstindex(t) == firstindex(c) == 1
cond = fill(true, length(t))
@inbounds for i in 1:2:length(t)
if t[i] >= c[i] || t[i+1] >= c[i+1]
cond[i] = false
cond[i+1] = false
and then do
df[testfun(df.t, df.c), :]
Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.
I assume @pdeffebach meant:
DataFrame(filter([:c, :t] => (c, t) -> all(((x, y),) -> x > y, zip(c, t)), groupby(df, :id)))
it is faster than my original code, but slower than what I have proposed above.