Converting filter using "||" condition to subset

How do I convert the following code fragment to use `subset` rather than `filter` and is there a benefit in doing so for large dataframes?

``````df = DataFrame(a = [1,2,3,4,6,6], b=[1,missing,3,missing,5,missing], c=[missing,2,missing,4,missing,6])
filter([:a, :b] => (x, y) -> ismissing(y) || x > y, df)
``````

I donβt know if you accept this solution:

``````julia> function isgreater(x,y)
if ismissing(x) || ismissing(y)
return false
end
return x>y
end

julia> subset(df, [:a,:b] => (x,y) -> ismissing.(y) .| isgreater.(x,y) )
4Γ3 DataFrame
Row β a      b        c
β Int64  Int64?   Int64?
ββββββΌβββββββββββββββββββββββββ
1 β     2  missing        2
2 β     4  missing        4
3 β     6        5  missing
4 β     6  missing        6
``````

I donβt think `subset` is a benefit, a benchmark shows itβs slower.

``````julia> @btime filter(:b => x -> ismissing(x), \$df)
1.330 ΞΌs (15 allocations: 1.42 KiB)

julia> @btime subset(\$df, :b => x -> ismissing.(x) )
20.700 ΞΌs (146 allocations: 8.77 KiB)
``````

In this case `filter` is efficient and `subset` does more pre and post processing work, so it can be expected to be slower (but the slowdown should be roughly a constant - as your data grows the performance should be comparable).

1 Like

Why not just use `ByRow`?

``````subset(df, [:a, :b] => ByRow((x, y) -> ismissing(y) || x > y))
``````
1 Like