Converting filter using "||" condition to subset

How do I convert the following code fragment to use subset rather than filter and is there a benefit in doing so for large dataframes?

df = DataFrame(a = [1,2,3,4,6,6], b=[1,missing,3,missing,5,missing], c=[missing,2,missing,4,missing,6])
filter([:a, :b] => (x, y) -> ismissing(y) || x > y, df)

I don’t know if you accept this solution:

julia> function isgreater(x,y)
           if ismissing(x) || ismissing(y)
               return false
           end
           return x>y
       end

julia> subset(df, [:a,:b] => (x,y) -> ismissing.(y) .| isgreater.(x,y) )
4Γ—3 DataFrame
 Row β”‚ a      b        c
     β”‚ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 β”‚     2  missing        2
   2 β”‚     4  missing        4
   3 β”‚     6        5  missing
   4 β”‚     6  missing        6

I don’t think subset is a benefit, a benchmark shows it’s slower.

julia> @btime filter(:b => x -> ismissing(x), $df)
  1.330 ΞΌs (15 allocations: 1.42 KiB)

julia> @btime subset($df, :b => x -> ismissing.(x) )
  20.700 ΞΌs (146 allocations: 8.77 KiB)

In this case filter is efficient and subset does more pre and post processing work, so it can be expected to be slower (but the slowdown should be roughly a constant - as your data grows the performance should be comparable).

1 Like

Why not just use ByRow?

subset(df, [:a, :b] => ByRow((x, y) -> ismissing(y) || x > y))
1 Like