Why does "filter" use an uncommon position for the input dataframe?

Juan · November 5, 2021, 1:14pm

df = DataFrame(x = [1.0,2.0,3.1, 44], y = 4.0:7.0, z = [1,1,2,2])

filter(:y => >=(3), df) # df must be written at the end.

While other functions use the first position.
subset(df, :y => x → x .>= 5)
select(df, [:x, :y])
…

jling · November 5, 2021, 1:15pm

I assume it’s so you can use:
https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments

bkamins · November 5, 2021, 1:28pm

Because this is the API for filter in Julia Base:

help?> filter
search: filter filter! fieldtype fieldtypes

  filter(f, a)

  Return a copy of collection a, removing elements for which f is false. The function f is passed one argument.

Henrique_Becker · November 5, 2021, 2:17pm

I have to say that I am not a big fan of this choice. That method of Base.filter has the arguments in that order because of the style guideline for methods that take a Function as parameter (put the ``Put the Function argument at the start so people can use the do syntax with it.‘’). However, the same does not work for a Pair or Symbol (or Vector{Symbol}) and Function. I think the most common reason for me to rerun a cell is because I put the filter for DataFrame with the arguments in the wrong order (i.e., I put the df first, as I find natural). I really preferred that DataFrames.jl sticked to the style guideline instead of trying to mimic the API of other Base.filter methods but using parameters that go against the reason these other methods have the parameters in that order.

bkamins · November 5, 2021, 2:33pm

filter in DataFrames.jl is a legacy function. Currently subset is a recommended function to use and it is consistent with “style guideline” you mention.

We have discussed if we should remove filter from DataFrames.jl, but we thought that it would break to much legacy code. Do you think deprecating it and removing in 2.0 release is preferred?

Henrique_Becker · November 5, 2021, 2:59pm

Oh, yes, unfortunately, removing filter is surely to break loads of code now. I was merely criticizing the choices of the past. Fortunately, for this kind of analysis, I always keep a manifest inside a git repository and can go back to the specific version used, so removing filter would only be a problem if I needed to upgrade the packages for the code.

I tend to be favorable to improve the API, even if this means breaking code, but only if it is reasonable to believe the previous users have a way to keep using the original version (what seems to be the case in Julia, with the Manifest.toml file). I am not a DataFrames.jl maintainer, however, (I think I did a single small contribution in the past) so I will not be living with the consequences of this choice.

bkamins · November 5, 2021, 3:24pm

The problem is that filter in DataFrames.jl was implemented and designed before select etc. were even invented.

Anyway - the conclusion is that in new code it is recommended to use subset for a consistent API. I will probably make a PR to the documentation of DataFrames.jl to more clearly highlight that filter is an old-style function and subset is preferred for consistency (@nalimilan predicted that we will have this very conversation after new “style guideline” gets a more wide adoption).

Juan · November 5, 2021, 3:47pm

And why do we need to write these verbose lines:
xx[xx[!,:x] .> 3 ,:]
or
subset(xx, :x => t → t .> 3.0)

instead of just
df[:x .> 3 ,:]
or
subset(xx, :x .> 3.0)
?

nilshg · November 5, 2021, 4:13pm

You might be looking for xx[xx.x .> 3, :] or DataFramesMeta which does something like your second suggestion.

bkamins · November 5, 2021, 4:58pm

Or to explain it more verbosely. In Julia writing :x .> 3 is an error as you can see here:

julia> :x .> 3
ERROR: MethodError: no method matching isless(::Int64, ::Symbol)

therefore your code is invalid. However, what you ask for is a natural wish - and this is exactly what DataFramesMeta.jl does. It resolves symbols as column names of a data frame.

Topic		Replies	Views
[DataFrames Question]: Filter function parameter order Data question	6	486	March 5, 2021
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24024	April 27, 2024
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	303	July 23, 2024
Is it possible to use filter(:col => myFilter, df) with additional constant input variables? New to Julia question , dataframes	12	790	January 5, 2022
Dynamically choosing correct way to filter General Usage dataframes	2	515	October 12, 2021

Why does "filter" use an uncommon position for the input dataframe?

Related topics