Is there a faster way to filter in dataframes

Is there a faster way to filter than the method below when applying filtering across multiple columns of a dataframe.

using DataFrames, BenchmarkTools
test_df = DataFrame(a = rand(10000), b = rand(10000), 
               c = rand(10000), d = rand(10000))
@btime filter(AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) & 
        (x.b > 0.5) & (0.25 <= x.a <= 0.75) ), test_df )

I can’t think of an obviously better way, no. Though the @. before the function is weird, I didn’t realize that syntax works.

Is it slow compared to other languages? If so it would be interesting to look in more to really push for performance.

EDIT: The use of @btime here might not be the most reliable, since you are constructing an anonymous function inside the call. Maybe wrap in another function and see how it goes?

1 Like

;view = true will make it faster if the use case is read-only

2 Likes
subset(test_df, AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75)))

subset isn’t as fast as filter.

allow missing and it will be

allowmissing!(test_df)
 @btime filter(AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75) ), test_df )
  2.177 ms (40077 allocations: 2.25 MiB)
@btime subset(test_df, AsTable([:a, :b]) => ( @. x -> !ismissing(x.a) &
               (x.b > 0.5) & (0.25 <= x.a <= 0.75) ) )
  118.203 μs (241 allocations: 127.20 KiB)
3 Likes

It is with ByRow instead of broadcasting. I think both filter and ByRow might enable some sort of multi-threading that isn’t being picked up with the broadcasting?

just offtopic, how I can do this for typed tables?

If you add another 0 to the length of test_data they should be the same. So I think you are seeing the complicated internal logic of subset compared to filter. This is a fixed cost and won’t matter for bigger data frames.

1 Like

Yes you are right. I am observing the same.