Complex filtering of DataFrame

I have this data frame that I want to filter:

using DataFrames
using StatsBase

df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))

that results in a DataFrame with 4 columns: id, s, t, and c. For some ids, there are situations where c > t for both values of s. I want to select only those cases, but I can’t figure out how to do it.

I have tried this:

gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)

but that also selects cases for when c > t for only one value of s. Any suggestions?

EDIT: Found a way:

insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)

Is there a more elegant way?

I assume you want something simple. Then you can do this:

DataFrame(filter(sdf -> all(sdf.c .> sdf.t), groupby(df, :id)))

(this is not the most efficient way to do it, but I believe the code is easiest to follow)

1 Like

a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.

df2 = filter([:t, :c] => (t, c) -> c > t, df)

filter(x->x.nrow==2,combine(groupby(df2, :id),:,nrow))
1 Like

Can you give some pointers about a more efficient way to do it?

try this, waiting for better one

even=reshape(Tables.rowtable(df),:,2)[2:2:end]
odd=reshape(Tables.rowtable(df)),:,2)[1:2:end]
DataFrame(reduce(vcat, [[f,s] for (f,s) in zip(odd,even) if f.c > f.t && s.c > s.t]))

1 Like

This seems like unnecessarily obtuse code, fwiw.

A “more efficient way to do it” would be to se the Pair syntax for filter rather than an anonymous function which applies to each row.

If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by :id, then you can do:

function testfun(t, c)
    @assert length(t) == length(c)
    @assert iseven(length(t))
    @assert firstindex(t) == firstindex(c) == 1
    cond = fill(true, length(t))
    @inbounds for i in 1:2:length(t)
        if t[i] >= c[i] || t[i+1] >= c[i+1]
            cond[i] = false
            cond[i+1] = false
        end
    end
    return cond
end

and then do

df[testfun(df.t, df.c), :]
2 Likes

Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.

I assume @pdeffebach meant:

DataFrame(filter([:c, :t] => (c, t) -> all(((x, y),) -> x > y, zip(c, t)), groupby(df, :id)))

it is faster than my original code, but slower than what I have proposed above.

2 Likes