Converting filter using "||" condition to subset

kobusherbst · May 15, 2021, 3:09pm

How do I convert the following code fragment to use subset rather than filter and is there a benefit in doing so for large dataframes?

df = DataFrame(a = [1,2,3,4,6,6], b=[1,missing,3,missing,5,missing], c=[missing,2,missing,4,missing,6])
filter([:a, :b] => (x, y) -> ismissing(y) || x > y, df)

oheil · May 15, 2021, 4:17pm

I don’t know if you accept this solution:

julia> function isgreater(x,y)
           if ismissing(x) || ismissing(y)
               return false
           end
           return x>y
       end

julia> subset(df, [:a,:b] => (x,y) -> ismissing.(y) .| isgreater.(x,y) )
4×3 DataFrame
 Row │ a      b        c
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2  missing        2
   2 │     4  missing        4
   3 │     6        5  missing
   4 │     6  missing        6

I don’t think subset is a benefit, a benchmark shows it’s slower.

oheil · May 15, 2021, 4:23pm

julia> @btime filter(:b => x -> ismissing(x), $df)
  1.330 μs (15 allocations: 1.42 KiB)

julia> @btime subset($df, :b => x -> ismissing.(x) )
  20.700 μs (146 allocations: 8.77 KiB)

bkamins · May 15, 2021, 5:29pm

In this case filter is efficient and subset does more pre and post processing work, so it can be expected to be slower (but the slowdown should be roughly a constant - as your data grows the performance should be comparable).

qsong · May 16, 2021, 4:53am

Why not just use ByRow?

subset(df, [:a, :b] => ByRow((x, y) -> ismissing(y) || x > y))

Topic		Replies	Views
Is there a faster way to filter in dataframes New to Julia dataframes , namedtuple	9	1293	August 29, 2021
ByRow subset vs filter performance Data dataframes	6	608	August 19, 2022
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	303	July 23, 2024
Filtering DataFrame based on two conditions General Usage dataframes	1	328	October 25, 2022
Dynamically choosing correct way to filter General Usage dataframes	2	515	October 12, 2021

Converting filter using "||" condition to subset

Related topics