df = DataFrame(x = [1.0,2.0,3.1, 44], y = 4.0:7.0, z = [1,1,2,2])
filter(:y => >=(3), df) # df must be written at the end.
While other functions use the first position.
subset(df, :y => x → x .>= 5)
select(df, [:x, :y])
…
df = DataFrame(x = [1.0,2.0,3.1, 44], y = 4.0:7.0, z = [1,1,2,2])
filter(:y => >=(3), df) # df must be written at the end.
While other functions use the first position.
subset(df, :y => x → x .>= 5)
select(df, [:x, :y])
…
I assume it’s so you can use:
https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments
Because this is the API for filter
in Julia Base:
help?> filter
search: filter filter! fieldtype fieldtypes
filter(f, a)
Return a copy of collection a, removing elements for which f is false. The function f is passed one argument.
I have to say that I am not a big fan of this choice. That method of Base.filter
has the arguments in that order because of the style guideline for methods that take a Function
as parameter (put the ``Put the Function
argument at the start so people can use the do
syntax with it.‘’). However, the same does not work for a Pair
or Symbol
(or Vector{Symbol}
) and Function
. I think the most common reason for me to rerun a cell is because I put the filter
for DataFrame
with the arguments in the wrong order (i.e., I put the df
first, as I find natural). I really preferred that DataFrames.jl
sticked to the style guideline instead of trying to mimic the API of other Base.filter
methods but using parameters that go against the reason these other methods have the parameters in that order.
filter
in DataFrames.jl is a legacy function. Currently subset
is a recommended function to use and it is consistent with “style guideline” you mention.
We have discussed if we should remove filter
from DataFrames.jl, but we thought that it would break to much legacy code. Do you think deprecating it and removing in 2.0 release is preferred?
Oh, yes, unfortunately, removing filter
is surely to break loads of code now. I was merely criticizing the choices of the past. Fortunately, for this kind of analysis, I always keep a manifest inside a git repository and can go back to the specific version used, so removing filter
would only be a problem if I needed to upgrade the packages for the code.
I tend to be favorable to improve the API, even if this means breaking code, but only if it is reasonable to believe the previous users have a way to keep using the original version (what seems to be the case in Julia, with the Manifest.toml
file). I am not a DataFrames.jl
maintainer, however, (I think I did a single small contribution in the past) so I will not be living with the consequences of this choice.
The problem is that filter
in DataFrames.jl was implemented and designed before select
etc. were even invented.
Anyway - the conclusion is that in new code it is recommended to use subset
for a consistent API. I will probably make a PR to the documentation of DataFrames.jl to more clearly highlight that filter
is an old-style function and subset
is preferred for consistency (@nalimilan predicted that we will have this very conversation after new “style guideline” gets a more wide adoption).
And why do we need to write these verbose lines:
xx[xx[!,:x] .> 3 ,:]
or
subset(xx, :x => t → t .> 3.0)
instead of just
df[:x .> 3 ,:]
or
subset(xx, :x .> 3.0)
?
You might be looking for xx[xx.x .> 3, :]
or DataFramesMeta which does something like your second suggestion.
Or to explain it more verbosely. In Julia writing :x .> 3
is an error as you can see here:
julia> :x .> 3
ERROR: MethodError: no method matching isless(::Int64, ::Symbol)
therefore your code is invalid. However, what you ask for is a natural wish - and this is exactly what DataFramesMeta.jl does. It resolves symbols as column names of a data frame.