What is the recommended way to filter rows of a Dataframe?

Hello,

I am asking for recommendations/tips on how to filter rows of a dataframe. I was using filter for sometime. But it seems inefficient/slow compared to subsetting dataframe using other methods. Here are benchmarks with three different ways of filtering a smallish dataframe of size (26760, 14). Do you prefer/recommend one of these ways, or something else entirely?

Thank you very much.

julia> import DataFramesMeta as Dfm

julia> dfsub1 = @btime Dfm.@subset(df1, :col8 .== 2, :col9 .== true);
  178.280 ÎĽs (385 allocations: 401.98 KiB)

julia> dfsub2 = @btime df1[(df1.col8 .== 2) .& (df1.col9 .== true), :];
  51.515 ÎĽs (87 allocations: 371.91 KiB)
  
julia> dfsub3 = @btime filter(row -> row.col8 .== 2 && row.col9 .== true, df1);
16.746 ms (134454 allocations: 3.23 MiB)

When I use DataFrames, I usually prioritize readability over speed (as long as there are no huge differences).

Considering this, I usually switch between the following options.

using DataFrames, DataFramesMeta, BenchmarkTools

dff = DataFrame(x = rand(100), y = rand(100), z = repeat([1,2], 50))


# fastest (as far as I know)
  # it solves the inherent type unstability of dataframes
function foo1()
    cond(x,y) = @.( (x > 0.5) && (y > 0.5) )
    
    dff[cond(dff.x, dff.y), :]
end
@btime foo1()  # 1.580 ÎĽs (14 allocations: 1.86 KiB)

# what I usually use (prioritize readability)
function foo2()
    cond = @with dff @.( (:x > 0.5) && (:y > 0.5) )
    
    dff[cond, :]
end

@btime foo2()  # 3.462 ÎĽs (29 allocations: 2.44 KiB)

In my experiencie, any other method provides a similar speed or is slower. I’d avoid filter for dataframes, because as far as a I remember it’s type unstable by construction.

The DataFrames documentation recommends using subset:

│ Note that as the subset function works in exactly the same way as other transformation functions defined in DataFrames.jl this is the preferred way to subset rows of a data frame or grouped data frame. In particular it uses a different set of rules for specifying transformations than filter which is implemented in DataFrames.jl to ensure support for the standard Julia API for collections.

It looks like the DataFramesMeta.@subset macro calls DataFrames.subset, so there shouldn’t be any difference there.

2 Likes

To recover the performance of filter use the cols => fun version and don’t broadcast your conditions, as filter operates row-wise already:

julia> using DataFrames, Chairmarks

julia> df = DataFrame(a = rand(1:10, 1_000_000); b = rand(Bool, 1_000_000));

julia> @b df[df.a .== 2 .&& df.b, :]
507.200 ÎĽs (30 allocs: 956.469 KiB)

julia> @b filter(r -> r.a == 2 && r.b, $df)
58.145 ms (2199647 allocs: 34.498 MiB)

julia> @b subset($df, :a => ByRow(==(2)), :b)
599.200 ÎĽs (192 allocs: 1.894 MiB)

julia> @b filter([:a, :b] => ((a, b) -> a == 2 && b), $df)
507.400 ÎĽs (27 allocs: 956.453 KiB)

julia> function foo1(data)
           cond(x, y) = x .== 2 .&& y

           df[cond(data.a, data.b), :]
       end
foo1 (generic function with 1 method)

julia> @b foo1($df)
504.900 ÎĽs (28 allocs: 956.406 KiB)
4 Likes

Thank you all. I learned new things from your answers. I cannot decide whether to accept the response by @alfaromartino, or by @nilshg as the solution. I will choose the latter because I will most likely use the foo1 example from there.

Can you please explain what is the significance of using $df (instead of just df) in these examples?

This interpolates the variable into the expression to ensure it is not treated as a global variable in the benchmark. See here in the BenchmarkTools docs (same holds for Chairmarks afaik):

1 Like

The “recommended” way for DataFramesMeta would be @rsubset, which is the same syntax as your first example but without the broadcasting dots.

(You may also find useful the summary of methods I posted here as part of a discussion for proposing better syntax for this.)

2 Likes