ByRow subset vs filter performance

hdavid16 · August 19, 2022, 4:39pm

I just compared subset (using a :col => ByRow(==(0)) operation) vs filter (using a :col => ==(0) operation) on a DataFrame. In both cases, I set view=true and I see significantly less allocations when using filter vs when using subset. Would you agree that in these cases, filter is more performant than the subset operation?

Thanks!

Edit, here is a MWE:

df=DataFrame(A = rand(10))
f1(df) = @time subset(df, :A => ByRow(<=(0.5)), view=true)
f2(df) = @time filter(:A => <=(0.5), df, view=true)

f1(df);
  0.000091 seconds (162 allocations: 8.391 KiB)

f2(df);
  0.000011 seconds (8 allocations: 320 bytes)

digital_carver · August 19, 2022, 5:29pm

A more typical benchmark using @btime seems to support this result too (with logical indexing leading to slightly better performance than filter despite having more allocations):

julia> begin
         df = DataFrame(A = rand(10000))
         f1(df) = subset(df, :A => ByRow(<=(0.5)), view = true)
         f2(df) = filter(:A => <=(0.5), df, view = true)
         f3(df) = subset(df, :A => a -> a .<= 0.5, view = true)
         f4(df) = @view df[df.A .<= 0.5, :] 
         @btime f1(df)
         @btime f2(df)
         @btime f3(df)
         @btime f4(df)
       end;
  602.734 ms (464204 allocations: 8.92 MiB)
  379.465 ms (298631 allocations: 4.98 MiB)
  924.066 ms (640950 allocations: 11.75 MiB)
  372.004 ms (328600 allocations: 6.05 MiB)

bkamins · August 19, 2022, 6:46pm

This is expected in this case.

jar1 · August 19, 2022, 7:55pm

Could you explain why should filter be faster than subset?

bkamins · August 19, 2022, 8:06pm

In short: because filter accepts only one condition + it works rowwise, so it has much simpler logic internally. subset allows passing multiple conditions + it works on whole columns. (if you want more details it is best to check the source code to see the differences in implementation)

hdavid16 · August 19, 2022, 8:13pm

Thanks for pointing this out @bkamins. In general, could we say that if we are doing row-wise filtering, even on multiple conditions (joined with && operators), we should expect filter to outperform subset?

bkamins · August 19, 2022, 8:39pm

yes. but you need to hardcode && in the predicate. Of course this assumes you use the
filter(cols => predicate, df) style.

Topic		Replies	Views
Converting filter using "\|\|" condition to subset General Usage dataframes	4	370	May 16, 2021
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	300	July 23, 2024
Is there a faster way to filter in dataframes New to Julia dataframes , namedtuple	9	1293	August 29, 2021
Performance of DataFrames' subset and ByRow Performance	10	1546	May 5, 2021
Hand rolled filter faster than filter function for dataframe? General Usage	2	481	August 9, 2019

ByRow subset vs filter performance

Related topics