using DataFrames
using StatsBase
df = DataFrame(id = repeat(sample(1:1000, 100; replace=false), inner=2), s = repeat(1:2, inner=1, outer=100))
insertcols!(df, :t => rand(2:100, 200))
insertcols!(df, :c => df.t .+ (rand(200) .> 0.25))
that results in a DataFrame with 4 columns: id, s, t, and c. For some ids, there are situations where c > t for both values of s. I want to select only those cases, but I can’t figure out how to do it.
I have tried this:
gdf = groupby(df, :id)
df2 = filter([:t, :c] => (t, c) -> c > t, gdf)
but that also selects cases for when c > t for only one value of s. Any suggestions?
EDIT: Found a way:
insertcols!(df, :ind => df.c .> df.t)
gdf = groupby(df, :id)
tot = transform(gdf, :ind => sum => :total)
df2 = filter(:total => x -> (x .== 2), tot)
a different way of proceeding that perhaps better reflects the attempt you made.
You select the rows for which c> t and then select only the groups that have a pair of rows.
df2 = filter([:t, :c] => (t, c) -> c > t, df)
filter(x->x.nrow==2,combine(groupby(df2, :id),:,nrow))
If you want to be efficient you take advantage of the fact that the original data frame is already pre-grouped by :id, then you can do:
function testfun(t, c)
@assert length(t) == length(c)
@assert iseven(length(t))
@assert firstindex(t) == firstindex(c) == 1
cond = fill(true, length(t))
@inbounds for i in 1:2:length(t)
if t[i] >= c[i] || t[i+1] >= c[i+1]
cond[i] = false
cond[i+1] = false
end
end
return cond
end
Could you elaborate on this observation and also explain the filter solution you are referring to?
I meant efficient in the sense of execution time. The solution that I have proposed is much faster than the other views, if I have not misused the benchmark tool.
Given that, by construction, in the present dataframe in 3 cases out of 4 C is greater than t,
common sense suggests filling the vector with all true and changing only about 1/4 of the elements.
However, I noticed that a small improvement can be obtained by splitting the comparison c >= t into two parts and putting the equality check first: c == t || c> t.
In fact, for the case in question and with the following dataframe, the calculation time of the “good” indices is more than halved.
df = DataFrame(id = repeat(sample(1:10^5, 10000; replace=false), inner=2), s = repeat(1:2, inner=1, outer=10000))
insertcols!(df, :t => rand(2:100, 20000))
insertcols!(df, :c => df.t .+ (rand(20000) .> 0.25))
julia> function testfun(t, c)
@assert length(t) == length(c)
@assert iseven(length(t))
@assert firstindex(t) == firstindex(c) == 1
cond = fill(true, length(t))
@inbounds for i in 1:2:length(t)
if t[i] >= c[i] || t[i+1] >= c[i+1]
cond[i] = false
cond[i+1] = false
end
end
return cond
end
testfun (generic function with 1 method)
julia> @btime testfun(df.t, df.c);
15.500 μs (2 allocations: 19.61 KiB)
julia> function testfun(t, c)
@assert length(t) == length(c)
@assert iseven(length(t))
@assert firstindex(t) == firstindex(c) == 1
cond = fill(true, length(t))
@inbounds for i in 1:2:length(t)
if t[i] == c[i]|| t[i+1] == c[i+1] ||t[i] > c[i] || t[i+1] > c[i+1]
cond[i] = false
cond[i+1] = false
end
end
return cond
end
testfun (generic function with 1 method)
julia> @btime testfun(df.t, df.c);
7.300 μs (2 allocations: 19.61 KiB)
I wonder, at this point, how the expression with >= is internally evaluated