Complex filtering of DataFrame

amrods · March 29, 2022, 3:19am

I have this data frame that I want to filter:

using DataFrames
using StatsBase

df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))

that results in a DataFrame with 4 columns: id, s, t, and c. For some ids, there are situations where c > t for both values of s. I want to select only those cases, but I can’t figure out how to do it.

I have tried this:

gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)

but that also selects cases for when c > t for only one value of s. Any suggestions?

EDIT: Found a way:

insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)

Is there a more elegant way?

bkamins · March 29, 2022, 7:21am

I assume you want something simple. Then you can do this:

DataFrame(filter(sdf -> all(sdf.c .> sdf.t), groupby(df, :id)))

(this is not the most efficient way to do it, but I believe the code is easiest to follow)

rocco_sprmnt21 · March 29, 2022, 11:01am

a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.

df2 = filter([:t, :c] => (t, c) -> c > t, df)

filter(x->x.nrow==2,combine(groupby(df2, :id),:,nrow))

amrods · March 29, 2022, 12:01pm

Can you give some pointers about a more efficient way to do it?

rocco_sprmnt21 · March 29, 2022, 1:26pm

try this, waiting for better one

even=reshape(Tables.rowtable(df),:,2)[2:2:end]
odd=reshape(Tables.rowtable(df)),:,2)[1:2:end]
DataFrame(reduce(vcat, [[f,s] for (f,s) in zip(odd,even) if f.c > f.t && s.c > s.t]))

pdeffebach · March 29, 2022, 1:44pm

This seems like unnecessarily obtuse code, fwiw.

A “more efficient way to do it” would be to se the Pair syntax for filter rather than an anonymous function which applies to each row.

bkamins · March 29, 2022, 3:37pm

If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by :id, then you can do:

function testfun(t, c)
    @assert length(t) == length(c)
    @assert iseven(length(t))
    @assert firstindex(t) == firstindex(c) == 1
    cond = fill(true, length(t))
    @inbounds for i in 1:2:length(t)
        if t[i] >= c[i] || t[i+1] >= c[i+1]
            cond[i] = false
            cond[i+1] = false
        end
    end
    return cond
end

and then do

df[testfun(df.t, df.c), :]

rocco_sprmnt21 · March 29, 2022, 3:38pm

Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.

bkamins · March 29, 2022, 3:45pm

I assume @pdeffebach meant:

DataFrame(filter([:c, :t] => (c, t) -> all(((x, y),) -> x > y, zip(c, t)), groupby(df, :id)))

it is faster than my original code, but slower than what I have proposed above.

rocco_sprmnt21 · July 22, 2022, 1:48pm

Given that, by construction, in the present dataframe in 3 cases out of 4 C is greater than t,
common sense suggests filling the vector with all true and changing only about 1/4 of the elements.
However, I noticed that a small improvement can be obtained by splitting the comparison c >= t into two parts and putting the equality check first: c == t || c> t.
In fact, for the case in question and with the following dataframe, the calculation time of the “good” indices is more than halved.

df = DataFrame(id = repeat(sample(1:10^5, 10000; replace=false), inner=2), s = repeat(1:2, inner=1, outer=10000))
insertcols!(df, :t => rand(2:100, 20000))
insertcols!(df, :c => df.t .+ (rand(20000) .> 0.25))



julia> function testfun(t, c)
           @assert length(t) == length(c)
           @assert iseven(length(t))
           @assert firstindex(t) == firstindex(c) == 1
           cond = fill(true, length(t))
           @inbounds for i in 1:2:length(t)
               if t[i] >= c[i]  || t[i+1] >= c[i+1]
                   cond[i] = false
                   cond[i+1] = false
               end
           end
           return cond
       end
testfun (generic function with 1 method)

julia> @btime testfun(df.t, df.c);
  15.500 μs (2 allocations: 19.61 KiB)


julia> function testfun(t, c)
           @assert length(t) == length(c)
           @assert iseven(length(t))
           @assert firstindex(t) == firstindex(c) == 1
           cond = fill(true, length(t))
           @inbounds for i in 1:2:length(t)
               if t[i] == c[i]|| t[i+1] == c[i+1] ||t[i] > c[i]  || t[i+1] > c[i+1]  
                   cond[i] = false
                   cond[i+1] = false
               end
           end
           return cond
       end
testfun (generic function with 1 method)


julia> @btime testfun(df.t, df.c);
  7.300 μs (2 allocations: 19.61 KiB)

I wonder, at this point, how the expression with >= is internally evaluated

Topic		Replies	Views
How to filter data conditional on two columns General Usage data , dataframes	17	1782	January 11, 2022
DataFramesMeta custom filter: by groups of A, apply filter on B Data	4	1418	May 27, 2019
DataFrame Filtering New to Julia question , dataframes	10	1121	September 20, 2023
Dataframe Filter New to Julia question , dataframes	6	6375	March 26, 2022
Filter rows after a perticular value in column New to Julia dataframes	5	1124	April 26, 2022

Complex filtering of DataFrame

Related topics