Is there a faster way to filter in dataframes

vtomar · August 28, 2021, 8:26pm

Is there a faster way to filter than the method below when applying filtering across multiple columns of a dataframe.

using DataFrames, BenchmarkTools
test_df = DataFrame(a = rand(10000), b = rand(10000), 
               c = rand(10000), d = rand(10000))
@btime filter(AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) & 
        (x.b > 0.5) & (0.25 <= x.a <= 0.75) ), test_df )

pdeffebach · August 28, 2021, 9:25pm

I can’t think of an obviously better way, no. Though the @. before the function is weird, I didn’t realize that syntax works.

Is it slow compared to other languages? If so it would be interesting to look in more to really push for performance.

EDIT: The use of @btime here might not be the most reliable, since you are constructing an anonymous function inside the call. Maybe wrap in another function and see how it goes?

jling · August 28, 2021, 9:36pm

;view = true will make it faster if the use case is read-only

DataFrames · August 28, 2021, 11:05pm

subset(test_df, AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75)))

vtomar · August 28, 2021, 11:15pm

subset isn’t as fast as filter.

DataFrames · August 28, 2021, 11:35pm

allow missing and it will be

allowmissing!(test_df)
 @btime filter(AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75) ), test_df )
  2.177 ms (40077 allocations: 2.25 MiB)
@btime subset(test_df, AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75) ) )
  118.203 μs (241 allocations: 127.20 KiB)

pdeffebach · August 29, 2021, 12:40am

It is with ByRow instead of broadcasting. I think both filter and ByRow might enable some sort of multi-threading that isn’t being picked up with the broadcasting?

xinchin · August 29, 2021, 12:56am

just offtopic, how I can do this for typed tables?

pdeffebach · August 29, 2021, 1:46am

If you add another 0 to the length of test_data they should be the same. So I think you are seeing the complicated internal logic of subset compared to filter. This is a fixed cost and won’t matter for bigger data frames.

vtomar · August 29, 2021, 5:19am

Yes you are right. I am observing the same.

Topic		Replies	Views
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	299	July 23, 2024
Converting filter using "\|\|" condition to subset General Usage dataframes	4	370	May 16, 2021
ByRow subset vs filter performance Data dataframes	6	607	August 19, 2022
Why is this dataframe filter slow? Data	5	1089	May 14, 2021
Is it possible to use filter(:col => myFilter, df) with additional constant input variables? New to Julia question , dataframes	12	789	January 5, 2022

Is there a faster way to filter in dataframes

Related topics