ByRow subset vs filter performance

I just compared subset (using a :col => ByRow(==(0)) operation) vs filter (using a :col => ==(0) operation) on a DataFrame. In both cases, I set view=true and I see significantly less allocations when using filter vs when using subset. Would you agree that in these cases, filter is more performant than the subset operation?

Thanks!

Edit, here is a MWE:

df=DataFrame(A = rand(10))
f1(df) = @time subset(df, :A => ByRow(<=(0.5)), view=true)
f2(df) = @time filter(:A => <=(0.5), df, view=true)

f1(df);
  0.000091 seconds (162 allocations: 8.391 KiB)

f2(df);
  0.000011 seconds (8 allocations: 320 bytes)

A more typical benchmark using @btime seems to support this result too (with logical indexing leading to slightly better performance than filter despite having more allocations):

julia> begin
         df = DataFrame(A = rand(10000))
         f1(df) = subset(df, :A => ByRow(<=(0.5)), view = true)
         f2(df) = filter(:A => <=(0.5), df, view = true)
         f3(df) = subset(df, :A => a -> a .<= 0.5, view = true)
         f4(df) = @view df[df.A .<= 0.5, :] 
         @btime f1(df)
         @btime f2(df)
         @btime f3(df)
         @btime f4(df)
       end;
  602.734 ms (464204 allocations: 8.92 MiB)
  379.465 ms (298631 allocations: 4.98 MiB)
  924.066 ms (640950 allocations: 11.75 MiB)
  372.004 ms (328600 allocations: 6.05 MiB)

This is expected in this case.

Could you explain why should filter be faster than subset?

In short: because filter accepts only one condition + it works rowwise, so it has much simpler logic internally. subset allows passing multiple conditions + it works on whole columns. (if you want more details it is best to check the source code to see the differences in implementation)

3 Likes

Thanks for pointing this out @bkamins. In general, could we say that if we are doing row-wise filtering, even on multiple conditions (joined with && operators), we should expect filter to outperform subset?

yes. but you need to hardcode && in the predicate. Of course this assumes you use the
filter(cols => predicate, df) style.

1 Like