What is the recommended way to filter rows of a Dataframe?

To recover the performance of filter use the cols => fun version and don’t broadcast your conditions, as filter operates row-wise already:

julia> using DataFrames, Chairmarks

julia> df = DataFrame(a = rand(1:10, 1_000_000); b = rand(Bool, 1_000_000));

julia> @b df[df.a .== 2 .&& df.b, :]
507.200 μs (30 allocs: 956.469 KiB)

julia> @b filter(r -> r.a == 2 && r.b, $df)
58.145 ms (2199647 allocs: 34.498 MiB)

julia> @b subset($df, :a => ByRow(==(2)), :b)
599.200 μs (192 allocs: 1.894 MiB)

julia> @b filter([:a, :b] => ((a, b) -> a == 2 && b), $df)
507.400 μs (27 allocs: 956.453 KiB)

julia> function foo1(data)
           cond(x, y) = x .== 2 .&& y

           df[cond(data.a, data.b), :]
       end
foo1 (generic function with 1 method)

julia> @b foo1($df)
504.900 μs (28 allocs: 956.406 KiB)
4 Likes