How do I convert the following code fragment to use subset
rather than filter
and is there a benefit in doing so for large dataframes?
df = DataFrame(a = [1,2,3,4,6,6], b=[1,missing,3,missing,5,missing], c=[missing,2,missing,4,missing,6])
filter([:a, :b] => (x, y) -> ismissing(y) || x > y, df)
oheil
May 15, 2021, 4:17pm
2
I donβt know if you accept this solution:
julia> function isgreater(x,y)
if ismissing(x) || ismissing(y)
return false
end
return x>y
end
julia> subset(df, [:a,:b] => (x,y) -> ismissing.(y) .| isgreater.(x,y) )
4Γ3 DataFrame
Row β a b c
β Int64 Int64? Int64?
ββββββΌβββββββββββββββββββββββββ
1 β 2 missing 2
2 β 4 missing 4
3 β 6 5 missing
4 β 6 missing 6
I donβt think subset
is a benefit, a benchmark shows itβs slower.
oheil
May 15, 2021, 4:23pm
3
julia> @btime filter(:b => x -> ismissing(x), $df)
1.330 ΞΌs (15 allocations: 1.42 KiB)
julia> @btime subset($df, :b => x -> ismissing.(x) )
20.700 ΞΌs (146 allocations: 8.77 KiB)
In this case filter
is efficient and subset
does more pre and post processing work, so it can be expected to be slower (but the slowdown should be roughly a constant - as your data grows the performance should be comparable).
1 Like
qsong
May 16, 2021, 4:53am
5
Why not just use ByRow
?
subset(df, [:a, :b] => ByRow((x, y) -> ismissing(y) || x > y))
1 Like