Why does "filter" use an uncommon position for the input dataframe?

df = DataFrame(x = [1.0,2.0,3.1, 44], y = 4.0:7.0, z = [1,1,2,2])

filter(:y => >=(3), df) # df must be written at the end.

While other functions use the first position.
subset(df, :y => x → x .>= 5)
select(df, [:x, :y])

1 Like

I assume it’s so you can use:
https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments

Because this is the API for filter in Julia Base:

help?> filter
search: filter filter! fieldtype fieldtypes

  filter(f, a)

  Return a copy of collection a, removing elements for which f is false. The function f is passed one argument.
3 Likes

I have to say that I am not a big fan of this choice. That method of Base.filter has the arguments in that order because of the style guideline for methods that take a Function as parameter (put the ``Put the Function argument at the start so people can use the do syntax with it.‘’). However, the same does not work for a Pair or Symbol (or Vector{Symbol}) and Function. I think the most common reason for me to rerun a cell is because I put the filter for DataFrame with the arguments in the wrong order (i.e., I put the df first, as I find natural). I really preferred that DataFrames.jl sticked to the style guideline instead of trying to mimic the API of other Base.filter methods but using parameters that go against the reason these other methods have the parameters in that order.

filter in DataFrames.jl is a legacy function. Currently subset is a recommended function to use and it is consistent with “style guideline” you mention.

We have discussed if we should remove filter from DataFrames.jl, but we thought that it would break to much legacy code. Do you think deprecating it and removing in 2.0 release is preferred?

4 Likes

Oh, yes, unfortunately, removing filter is surely to break loads of code now. I was merely criticizing the choices of the past. Fortunately, for this kind of analysis, I always keep a manifest inside a git repository and can go back to the specific version used, so removing filter would only be a problem if I needed to upgrade the packages for the code.

I tend to be favorable to improve the API, even if this means breaking code, but only if it is reasonable to believe the previous users have a way to keep using the original version (what seems to be the case in Julia, with the Manifest.toml file). I am not a DataFrames.jl maintainer, however, (I think I did a single small contribution in the past) so I will not be living with the consequences of this choice.

The problem is that filter in DataFrames.jl was implemented and designed before select etc. were even invented.

Anyway - the conclusion is that in new code it is recommended to use subset for a consistent API. I will probably make a PR to the documentation of DataFrames.jl to more clearly highlight that filter is an old-style function and subset is preferred for consistency (@nalimilan predicted that we will have this very conversation after new “style guideline” gets a more wide adoption).

And why do we need to write these verbose lines:
xx[xx[!,:x] .> 3 ,:]
or
subset(xx, :x => t → t .> 3.0)

instead of just
df[:x .> 3 ,:]
or
subset(xx, :x .> 3.0)
?

You might be looking for xx[xx.x .> 3, :] or DataFramesMeta which does something like your second suggestion.

1 Like

Or to explain it more verbosely. In Julia writing :x .> 3 is an error as you can see here:

julia> :x .> 3
ERROR: MethodError: no method matching isless(::Int64, ::Symbol)

therefore your code is invalid. However, what you ask for is a natural wish - and this is exactly what DataFramesMeta.jl does. It resolves symbols as column names of a data frame.

1 Like