I am asking for recommendations/tips on how to filter rows of a dataframe. I was using filter for sometime. But it seems inefficient/slow compared to subsetting dataframe using other methods. Here are benchmarks with three different ways of filtering a smallish dataframe of size (26760, 14). Do you prefer/recommend one of these ways, or something else entirely?
When I use DataFrames, I usually prioritize readability over speed (as long as there are no huge differences).
Considering this, I usually switch between the following options.
using DataFrames, DataFramesMeta, BenchmarkTools
dff = DataFrame(x = rand(100), y = rand(100), z = repeat([1,2], 50))
# fastest (as far as I know)
# it solves the inherent type unstability of dataframes
function foo1()
cond(x,y) = @.( (x > 0.5) && (y > 0.5) )
dff[cond(dff.x, dff.y), :]
end
@btime foo1() # 1.580 ÎĽs (14 allocations: 1.86 KiB)
# what I usually use (prioritize readability)
function foo2()
cond = @with dff @.( (:x > 0.5) && (:y > 0.5) )
dff[cond, :]
end
@btime foo2() # 3.462 ÎĽs (29 allocations: 2.44 KiB)
In my experiencie, any other method provides a similar speed or is slower. I’d avoid filter for dataframes, because as far as a I remember it’s type unstable by construction.
The DataFrames documentation recommends using subset:
│ Note that as the subset function works in exactly the same way as other transformation functions defined in DataFrames.jl this is the preferred way to subset rows of a data frame or grouped data frame. In particular it uses a different set of rules for specifying transformations than filter which is implemented in DataFrames.jl to ensure support for the standard Julia API for collections.
It looks like the DataFramesMeta.@subset macro calls DataFrames.subset, so there shouldn’t be any difference there.
Thank you all. I learned new things from your answers. I cannot decide whether to accept the response by @alfaromartino, or by @nilshg as the solution. I will choose the latter because I will most likely use the foo1 example from there.
Can you please explain what is the significance of using $df (instead of just df) in these examples?
This interpolates the variable into the expression to ensure it is not treated as a global variable in the benchmark. See here in the BenchmarkTools docs (same holds for Chairmarks afaik):