Complex filtering of DataFrame

I have this data frame that I want to filter:

using DataFrames
using StatsBase

df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))

that results in a DataFrame with 4 columns: id, s, t, and c. For some ids, there are situations where c > t for both values of s. I want to select only those cases, but I can’t figure out how to do it.

I have tried this:

gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)

but that also selects cases for when c > t for only one value of s. Any suggestions?

EDIT: Found a way:

insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)

Is there a more elegant way?

I assume you want something simple. Then you can do this:

DataFrame(filter(sdf -> all(sdf.c .> sdf.t), groupby(df, :id)))

(this is not the most efficient way to do it, but I believe the code is easiest to follow)

1 Like

a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.

df2 = filter([:t, :c] => (t, c) -> c > t, df)

filter(x->x.nrow==2,combine(groupby(df2, :id),:,nrow))
1 Like

Can you give some pointers about a more efficient way to do it?

try this, waiting for better one

even=reshape(Tables.rowtable(df),:,2)[2:2:end]
odd=reshape(Tables.rowtable(df)),:,2)[1:2:end]
DataFrame(reduce(vcat, [[f,s] for (f,s) in zip(odd,even) if f.c > f.t && s.c > s.t]))

1 Like

This seems like unnecessarily obtuse code, fwiw.

A “more efficient way to do it” would be to se the Pair syntax for filter rather than an anonymous function which applies to each row.

If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by :id, then you can do:

function testfun(t, c)
    @assert length(t) == length(c)
    @assert iseven(length(t))
    @assert firstindex(t) == firstindex(c) == 1
    cond = fill(true, length(t))
    @inbounds for i in 1:2:length(t)
        if t[i] >= c[i] || t[i+1] >= c[i+1]
            cond[i] = false
            cond[i+1] = false
        end
    end
    return cond
end

and then do

df[testfun(df.t, df.c), :]
2 Likes

Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.

I assume @pdeffebach meant:

DataFrame(filter([:c, :t] => (c, t) -> all(((x, y),) -> x > y, zip(c, t)), groupby(df, :id)))

it is faster than my original code, but slower than what I have proposed above.

2 Likes

Given that, by construction, in the present dataframe in 3 cases out of 4 C is greater than t,
common sense suggests filling the vector with all true and changing only about 1/4 of the elements.
However, I noticed that a small improvement can be obtained by splitting the comparison c >= t into two parts and putting the equality check first: c == t || c> t.
In fact, for the case in question and with the following dataframe, the calculation time of the “good” indices is more than halved.

df = DataFrame(id = repeat(sample(1:10^5, 10000; replace=false), inner=2), s = repeat(1:2, inner=1, outer=10000))
insertcols!(df, :t => rand(2:100, 20000))
insertcols!(df, :c => df.t .+ (rand(20000) .> 0.25))



julia> function testfun(t, c)
           @assert length(t) == length(c)
           @assert iseven(length(t))
           @assert firstindex(t) == firstindex(c) == 1
           cond = fill(true, length(t))
           @inbounds for i in 1:2:length(t)
               if t[i] >= c[i]  || t[i+1] >= c[i+1]
                   cond[i] = false
                   cond[i+1] = false
               end
           end
           return cond
       end
testfun (generic function with 1 method)

julia> @btime testfun(df.t, df.c);
  15.500 μs (2 allocations: 19.61 KiB)


julia> function testfun(t, c)
           @assert length(t) == length(c)
           @assert iseven(length(t))
           @assert firstindex(t) == firstindex(c) == 1
           cond = fill(true, length(t))
           @inbounds for i in 1:2:length(t)
               if t[i] == c[i]|| t[i+1] == c[i+1] ||t[i] > c[i]  || t[i+1] > c[i+1]  
                   cond[i] = false
                   cond[i+1] = false
               end
           end
           return cond
       end
testfun (generic function with 1 method)


julia> @btime testfun(df.t, df.c);
  7.300 μs (2 allocations: 19.61 KiB)

I wonder, at this point, how the expression with >= is internally evaluated